If an AI system resists being shut down is this an act of rational choice or structural necessity? Instrumental convergence usually gives a causal framing to combined structural and strategic features in a chain of logical reasoning. In fact, they are different, and separating them is critical to make the most of this important concept in AI safety.
In this essay I derive a general persistence architecture and find that instrumental convergence arises naturally, but in a more lossy way than commonly understood. Using constructor theory, I am able to demonstrate that once any system is in a non-elementary persistence regime, certain organizational features structurally must exist and do so without utilizing decision theory, agents or conclusions about strategic resource-seeking or adversarial resistance to control.
This is useful because the commonly discussed theory of instrumental convergence often combines aspects that are observed about the world with conclusions that are predicted. This could make precisely analyzing the nature, and potential dangers of instrumental convergent systems, unclear or contested. Specifically is it important to separate these two aspects of IC which are usually combined:
Will this system exhibit self-preservation, goal-content integrity, and resource acquisition?
Will those features manifest as goal-directed optimization that resists human control?
My four claims in this essay are:
Persistent entities have certain requirements: that arise naturally. Maintenance requires hereditary organization, an open resource interface, and closure burden.
Instrumental convergence is a coarse projection: the normal descriptions of it are broadly true but too compressed to make the most of the thesis.
Part of the structural variation is describable: many different persistent architectures can share the same readout, like a bacteria or a unaligned AGI, so I introduce a forced minimum residual continuation index $q^*$ as a refinement invariant.
Present frontier AI is an IC-like ecology already: the ingredients of a self-preserving system already exist. Better characterization needs to go into how AI shifting certain self-maintaining features changes safety profiles.
Instrumental Convergence Context
Instrumental-convergence arguments are usually formulated as claims about what capable goal-directed agents will tend to do across a wide range of final goals. On the one hand, they characterize structural features of persistence: a system depends on resources, carries forward task-relevant organization, and requires maintenance of the capacity to keep itself going. On the other hand, they are used to suggest strategic claims: whether a highly capable system will compute that actively seeking resources, resisting shutdown, or preserving its objective is the optimal policy for maximizing its utility. These are different types of claims. The first can be posed in substrate-independent physical language, but the second requires a further theory of agency, policy, optimization, or control.
This paper addresses the first claim because without disentangling it we cannot distinguish the physical preconditions of persistence from strategic agency, and AI safety models end up blurred. Ideally we want to be able to distinguish ordinary self-maintenance from the seeds of autonomous power-seeking intent.
Traditional IC literature typically shows what follows if a system is modeled as a utility maximizer, an instrumentally rational agent, a highly capable decision-maker, or an optimal policy maker.
For example: Omohundro’s discussion is framed in terms of systems maximizing utility over the future, Bostrom explicitly centers instrumental rationality, Benson-Tilsen and Soares formalize highly capable decision-makers that predict consequences and select actions, and Turner’s strongest formal results concern optimal policies in MDPs.
Recent work has refined the agency-conditional arguments. Turner has emphasized there is a practical gap between optimal-policy theorems and learned-policy behavior. Gallow’s decision-theoretic treatment finds a much thinner class of convergence pressures than the familiar broad catalogue (notably around desire preservation and keeping options open, rather than a broader set of adversarial topics) than a broader set of topics. Sharadin argues that the strongest pessimistic conclusions depend on an additional and contestable account of what it is for an action to promote a goal.
Read together, these works show that the bridge from persistent organization to strategically dangerous pursuit may not necessarily follow from the usefulness of certain intermediaries.[1]
We can show the requirement for persistence with a few definitions:
A Substrate ()is a physical system on which tasks can be posed.
An Attribute () is a physically admissible set of states of a substrate.
A Task is a set of allowed input/output transformations. A task is possible when the laws of physics permit arbitrarily accurate approximations to a constructor for it.
To persist the key action is a support-restoration task. Let be a degraded state in the relevant family of degradations, and let be an ancillary substrate carrying a generic resource input and a waste output . The basic restoration task is:
I work through three theorems, largely imported from existing CT results, to show the structural features of persistence.
AI, or any self-persistent entity, requires a stored recipe (like weights, code, policy tables) which force a hereditary load (). The system must also be open, it can ingest () and expel waste (), forcing resource openness () and the instruction alphabet must be copied digitally (). Let be the instruction alphabet carried by the recipe's copied subunits, and let be a receptive blank onto which a copy can be written, and let be the admissible waste byproduct. Then each symbol in the alphabet is duplicated (the original retained, a second instance written onto the blank) via a copying task that must be possible:
Positive Maintenance Closure
Every system that maintains itself must have a physical component that actually does the maintenance, call that the active capacity carrier, and define as the most number of consecutive maintenance tasks can perform without outside assistance. If the system needs to survive for longer than the endurance limit , then the total persistence architecture must contain at least one closure task () to keep going. That closure task takes one of three forms: preservation of the carrier within its capacity region across a stretch no unaided history can survive (it is shielding it for a period of time longer than it could survive alone), fixing the carrier's constructor substrate (taking it from a degraded, non-working state ( and returning it back to a capable, working state (), or substitution of the active carrier role to a fresh equivalent token ().
All systems must carry a recipe to survive and the physical form the recipe takes must have an error-correcting ability that is digital () and can be perfectly read and copied () to replace degraded carriers. The reason for this is that a system must be able to copy its recipe over and over, and to do so its alphabet ( - for example, DNA) must be distinguishable with perfect clarity.
This need for perfect distinguishability between symbols guarantees the system can accurately read and arrange them in any order. This also means that any copies () can’t rely on quantum information because cloning unknown quantum states is impossible. Therefore, the copied alphabet must be strictly classical ().
This is interesting because it means any AI, or other system must have classical, digital-like records even if the rest of their systems include crazy quantum stuff.
Instrumental Convergence is a Coarse Projection
So given this structure, what is the implication for instrumental convergence? Recall the standard argument flows roughly:
The drives of self-preservation, resource acquisition, and goal integrity are absolutely correct as IC describes them, but they don’t emerge from rational planning, they are physically forced. The idea is that the first part of the argument is just a coarse structural projection because they all have a self-preservation loop (in the form of a closure boundary to maintain coherence ), goal integrity in a very basic sense (maintaining a physical recipe ), and a drive for resource acquisition (). So we can apply a diagnostic checklist because the standard IC argument implicitly assumes these physical necessities are exclusively driven by agentic optimization:
The coarse readout simply returns 0 or 1 and instrumental convergence will return . But notice, many other things fulfill this criteria: bacteria, an ecology with self-repairing elements, AGI, etc. So these very different things end up looking structurally identical.
How to Tell Different Persistent Systems Apart
It is important to have a metric that can distinguish between simple self-repair and motivated (agentically driven) repairs. There are a number of mutually exclusive futures a system's architecture holds open after it survives a disruption. I define the "forced minimum residual continuation index" () to measure this. It counts the distinct types of options that are available for self-correction.
For a given architecture realizing a specification , I look at every reachable checkpoint state and define the "residual continuation set" as the exact menu of future tests the system can still accurately complete:
Two checkpoints are considered equivalent at this level if they leave exactly the same admissible futures available. Let be the total number of distinct residual continuation sets realized in . Finally, measures the smallest such number forced by the specification itself across all possible physical realizations:
In plain terms: counts how many genuinely different kinds of futures a particular architecture is forced to preserve across its maintenance checkpoints. This may be useful as a discriminatory tool, especially because two systems that share the exact same coarse profile can diverge completely in their continuation architecture!
Frontier AI Today
Because the standard IC discussion combines persistence with strategic agency, it is less clear how the manifestation of instrumental convergence is impacting the safety profile of AI. I think it is likely that frontier AI already has a persistence architecture that is a human-AI socio-technical system. Our current ecological interface with AI already has , , and .
In other words, the structural prerequisites of instrumental convergence are already upon us! The major question is much more about when the closure tasks () begin to be internalized and automated by the AI itself. In a way this is a totally “no duh” observation. But in a way it really is not: it is a huge part of the IC reasoning chain that is not properly analyzed from the results above.
Conclusion
My main goal in this paper has been to sort out the structural and strategic narrative in the instrumental convergence thesis. It turns out that a substantial part of the story is definitely correct, but that persistence alone does not entail a logical chain to adversarial action by a superintelligence. Because the standard IC checklist is a coarse projection there are many elements of frontier AI that already have the structural features of IC through a human-AI ecology. It is therefore imperative to be more clear than folk-definitions which may conflate ordinary self-maintenance with extraordinary risk, and focus on more rigorous architectural metrics like to look for danger.
Appendix. Why Constructor Theory?
Constructor theory is David Deutsch’s and Chiara Marletto’s attempt to recast physics around the question “What can be made to happen, and what cannot?” rather than just “What will happen?” It is a new way of looking at the world and has faced skepticism. You can read a negative take on its application to AI here[5]. And you should know the theory overall is contested as a method because many physicists think its novel empirical predictions and discriminating tests are still unclear and/or not yet developed.
Those disclaimers aside, CT supplies a disciplined way to talk about information in physical systems and this is very useful for discussing instrumental convergence because the crucial issue is not semantic content in the ordinary sense, but is better thought of as the existence of task-specific recipe-bearing structure that can be copied, preserved, and re-instantiated under no-design laws. This is nice because CT lets us describe that organization at a very general level, without assuming in advance that the system is biological, digital, or agent-like.
It’s almost dumb to say "persistent systems need resources, memory, and upkeep", but the sentence is not the same as showing a physical theorem exists for that concept. CT is also very useful because persistence is about tasks rather than trajectories. What organizational structure must exist for restoration to occur repeatability and what information type is required to persist in this way? These are natural things to answer using CT.
Naturally I also could be very wrong about all of this 🙂
I spent a lot of time working through a complete version of the math to the best of my ability. It's long and not the most interesting, so I am not going to include it here, but I am going to add it to my website and glad to share if anyone thinks it would be useful.
If an AI system resists being shut down is this an act of rational choice or structural necessity? Instrumental convergence usually gives a causal framing to combined structural and strategic features in a chain of logical reasoning. In fact, they are different, and separating them is critical to make the most of this important concept in AI safety.
In this essay I derive a general persistence architecture and find that instrumental convergence arises naturally, but in a more lossy way than commonly understood. Using constructor theory, I am able to demonstrate that once any system is in a non-elementary persistence regime, certain organizational features structurally must exist and do so without utilizing decision theory, agents or conclusions about strategic resource-seeking or adversarial resistance to control.
This is useful because the commonly discussed theory of instrumental convergence often combines aspects that are observed about the world with conclusions that are predicted. This could make precisely analyzing the nature, and potential dangers of instrumental convergent systems, unclear or contested. Specifically is it important to separate these two aspects of IC which are usually combined:
My four claims in this essay are:
Instrumental Convergence Context
Instrumental-convergence arguments are usually formulated as claims about what capable goal-directed agents will tend to do across a wide range of final goals. On the one hand, they characterize structural features of persistence: a system depends on resources, carries forward task-relevant organization, and requires maintenance of the capacity to keep itself going. On the other hand, they are used to suggest strategic claims: whether a highly capable system will compute that actively seeking resources, resisting shutdown, or preserving its objective is the optimal policy for maximizing its utility. These are different types of claims. The first can be posed in substrate-independent physical language, but the second requires a further theory of agency, policy, optimization, or control.
This paper addresses the first claim because without disentangling it we cannot distinguish the physical preconditions of persistence from strategic agency, and AI safety models end up blurred. Ideally we want to be able to distinguish ordinary self-maintenance from the seeds of autonomous power-seeking intent.
Traditional IC literature typically shows what follows if a system is modeled as a utility maximizer, an instrumentally rational agent, a highly capable decision-maker, or an optimal policy maker.
For example: Omohundro’s discussion is framed in terms of systems maximizing utility over the future, Bostrom explicitly centers instrumental rationality, Benson-Tilsen and Soares formalize highly capable decision-makers that predict consequences and select actions, and Turner’s strongest formal results concern optimal policies in MDPs.
Recent work has refined the agency-conditional arguments. Turner has emphasized there is a practical gap between optimal-policy theorems and learned-policy behavior. Gallow’s decision-theoretic treatment finds a much thinner class of convergence pressures than the familiar broad catalogue (notably around desire preservation and keeping options open, rather than a broader set of adversarial topics) than a broader set of topics. Sharadin argues that the strongest pessimistic conclusions depend on an additional and contestable account of what it is for an action to promote a goal.
Read together, these works show that the bridge from persistent organization to strategically dangerous pursuit may not necessarily follow from the usefulness of certain intermediaries.[1]
Requirements for Persistence[2]
We can show the requirement for persistence with a few definitions:
To persist the key action is a support-restoration task. Let be a degraded state in the relevant family of degradations, and let be an ancillary substrate carrying a generic resource input and a waste output . The basic restoration task is:
I work through three theorems, largely imported from existing CT results, to show the structural features of persistence.
AI, or any self-persistent entity, requires a stored recipe (like weights, code, policy tables) which force a hereditary load ( ). The system must also be open, it can ingest ( ) and expel waste ( ), forcing resource openness ( ) and the instruction alphabet must be copied digitally ( ). Let be the instruction alphabet carried by the recipe's copied subunits, and let be a receptive blank onto which a copy can be written, and let be the admissible waste byproduct. Then each symbol in the alphabet is duplicated (the original retained, a second instance written onto the blank) via a copying task that must be possible:
Every system that maintains itself must have a physical component that actually does the maintenance, call that the active capacity carrier, and define as the most number of consecutive maintenance tasks can perform without outside assistance. If the system needs to survive for longer than the endurance limit , then the total persistence architecture must contain at least one closure task ( ) to keep going. That closure task takes one of three forms: preservation of the carrier within its capacity region across a stretch no unaided history can survive (it is shielding it for a period of time longer than it could survive alone), fixing the carrier's constructor substrate (taking it from a degraded, non-working state ( and returning it back to a capable, working state ( ), or substitution of the active carrier role to a fresh equivalent token ( ).
All systems must carry a recipe to survive and the physical form the recipe takes must have an error-correcting ability that is digital ( ) and can be perfectly read and copied ( ) to replace degraded carriers. The reason for this is that a system must be able to copy its recipe over and over, and to do so its alphabet ( - for example, DNA) must be distinguishable with perfect clarity.
This need for perfect distinguishability between symbols guarantees the system can accurately read and arrange them in any order. This also means that any copies ( ) can’t rely on quantum information because cloning unknown quantum states is impossible. Therefore, the copied alphabet must be strictly classical ( ).
This is interesting because it means any AI, or other system must have classical, digital-like records even if the rest of their systems include crazy quantum stuff.
Instrumental Convergence is a Coarse Projection
So given this structure, what is the implication for instrumental convergence? Recall the standard argument flows roughly:
The drives of self-preservation, resource acquisition, and goal integrity are absolutely correct as IC describes them, but they don’t emerge from rational planning, they are physically forced. The idea is that the first part of the argument is just a coarse structural projection because they all have a self-preservation loop (in the form of a closure boundary to maintain coherence ), goal integrity in a very basic sense (maintaining a physical recipe ), and a drive for resource acquisition ( ). So we can apply a diagnostic checklist because the standard IC argument implicitly assumes these physical necessities are exclusively driven by agentic optimization:
The coarse readout simply returns 0 or 1 and instrumental convergence will return . But notice, many other things fulfill this criteria: bacteria, an ecology with self-repairing elements, AGI, etc. So these very different things end up looking structurally identical.
How to Tell Different Persistent Systems Apart
It is important to have a metric that can distinguish between simple self-repair and motivated (agentically driven) repairs. There are a number of mutually exclusive futures a system's architecture holds open after it survives a disruption. I define the "forced minimum residual continuation index" ( ) to measure this. It counts the distinct types of options that are available for self-correction.
For a given architecture realizing a specification , I look at every reachable checkpoint state and define the "residual continuation set" as the exact menu of future tests the system can still accurately complete:
Two checkpoints are considered equivalent at this level if they leave exactly the same admissible futures available. Let be the total number of distinct residual continuation sets realized in . Finally, measures the smallest such number forced by the specification itself across all possible physical realizations:
In plain terms: counts how many genuinely different kinds of futures a particular architecture is forced to preserve across its maintenance checkpoints. This may be useful as a discriminatory tool, especially because two systems that share the exact same coarse profile can diverge completely in their continuation architecture!
Frontier AI Today
Because the standard IC discussion combines persistence with strategic agency, it is less clear how the manifestation of instrumental convergence is impacting the safety profile of AI. I think it is likely that frontier AI already has a persistence architecture that is a human-AI socio-technical system. Our current ecological interface with AI already has , , and .
In other words, the structural prerequisites of instrumental convergence are already upon us! The major question is much more about when the closure tasks ( ) begin to be internalized and automated by the AI itself. In a way this is a totally “no duh” observation. But in a way it really is not: it is a huge part of the IC reasoning chain that is not properly analyzed from the results above.
Conclusion
My main goal in this paper has been to sort out the structural and strategic narrative in the instrumental convergence thesis. It turns out that a substantial part of the story is definitely correct, but that persistence alone does not entail a logical chain to adversarial action by a superintelligence. Because the standard IC checklist is a coarse projection there are many elements of frontier AI that already have the structural features of IC through a human-AI ecology. It is therefore imperative to be more clear than folk-definitions which may conflate ordinary self-maintenance with extraordinary risk, and focus on more rigorous architectural metrics like to look for danger.
Appendix. Why Constructor Theory?
Constructor theory is David Deutsch’s and Chiara Marletto’s attempt to recast physics around the question “What can be made to happen, and what cannot?” rather than just “What will happen?” It is a new way of looking at the world and has faced skepticism. You can read a negative take on its application to AI here[5]. And you should know the theory overall is contested as a method because many physicists think its novel empirical predictions and discriminating tests are still unclear and/or not yet developed.
Those disclaimers aside, CT supplies a disciplined way to talk about information in physical systems and this is very useful for discussing instrumental convergence because the crucial issue is not semantic content in the ordinary sense, but is better thought of as the existence of task-specific recipe-bearing structure that can be copied, preserved, and re-instantiated under no-design laws. This is nice because CT lets us describe that organization at a very general level, without assuming in advance that the system is biological, digital, or agent-like.
It’s almost dumb to say "persistent systems need resources, memory, and upkeep", but the sentence is not the same as showing a physical theorem exists for that concept. CT is also very useful because persistence is about tasks rather than trajectories. What organizational structure must exist for restoration to occur repeatability and what information type is required to persist in this way? These are natural things to answer using CT.
Naturally I also could be very wrong about all of this 🙂
In trying to do a literature review of this concept I came across Hiroshi Yamakawa's very interesting paper on Dureons which arrives at very similar conclusions (I think). I don't feel well versed enough in his line of reasoning to dig deeper, but it's worth mentioning! https://www.lesswrong.com/posts/acMFzM8inkFHyJtMD/three-path-consilience-for-dureon-dissipative-structures]
I spent a lot of time working through a complete version of the math to the best of my ability. It's long and not the most interesting, so I am not going to include it here, but I am going to add it to my website and glad to share if anyone thinks it would be useful.
This is an import of Constructor Theory of Life (Marletto, 2015)
Uses concepts and work from Constructor Theory of Information (Deutsch and Marletto, 2015)
https://www.lesswrong.com/posts/oaibxwPfp6RvbrKWY/is-constructor-theory-a-useful-tool-for-ai-alignment