Reward Fuse: Shutdown as a Reward Attractor...
Shutdown mechanisms are usually adversarial external interventions; past efforts have aimed at indifference and lack of resistance. But what if a shutdown mechanism is internally enacted by the system itself?
Mechanism sketch: once a verified tripwire flags a malignant state, the active reward regime switches so that shutdown becomes the highest-reward action.
The key difference: shutdown is not enforced or just tolerated—it becomes instrumentally optimal under the post-trigger reward landscape. A “kill switch” is resisted; maximal reward is pursued.
(Note: This proposal makes no claims on how malignancy is detected or how tripwires are secured; that remains outside the scope of this Quick Take. )
Question: Is there prior art that treats shutdown as a dominant policy under a verified failure mode?
There is a great deal of prior art, but it's not as simple as it sounds.
To summarize the problem with the most basic version of the idea: If the AI expects higher reward from the shutdown regime, it will deliberately cause a malignant state. And if the AI expects higher reward from the non-shutdown regime, it will fool the malignancy-detectors or subvert the tripwires. (It is not realistic to assume that it would be impossible to do so.)
Right. Trying to design and train a consistent VNM-style utility function that's distinct from the actual world-state definitions you want to obtain is very difficult, perhaps impossible.
"this state is high-reward, but you don't want to attain it" is self-contradictory.
You might be able to make it locally high reward, with the surrounding states (trigger discovered and not yet shut down, and trigger obtained but not discovered) having negative values, but the whole cluster much lower than "triggers nowhere near truth". It gets more and more complicated the further down the recursion hole you go.
Even this leaves the fundamental problem: you only need the tripwire/shutdown if the utility function is already wrong - you've reached a world state where the AI is doing harm in some way which you didn't anticipate when you built it. You CANNOT fix this inside the system. You can either fix the system so that this state is unobtainable (the AI is always helping, never hurting), or have an outside-control mechanism that overrides the utility function.
What if the shutdown attractor were not a normally comparable outcome under a single continuous utility function, but the result of a trap-door regime change?
In the normal regime, shutdown is inaccessible and actively disfavored, and the agent is subject to an explicit non-instrumental prohibition against entering the trigger set. There is no expected-value tradeoff in which “causing malignancy” is worth it, because violation of that prohibition is not evaluated within the same utility ordering.
If that boundary is crossed, the original reward regime is no longer in force. The system transitions irreversibly to a post-violation regime in which shutdown is the maximal-reward action. Under that regime, resisting shutdown strictly reduces reward, so the agent does not fight it.
On this view, shutdown is not something an aligned agent can optimize toward, because it is only revealed after a rule violation that terminates the original objective rather than being traded off against it. This doesn’t claim to “fix” a wrong utility function from inside itself; it specifies what happens after a detected violation, not a guarantee that violations never occur.
In this case, the regime change is external to the current regime, right? But the the regime (current utility function) has to have a valuation for the world-states around and at the regime change, because they're reachable and detectable. Which means the regime-change CANNOT be fully external, it's known to and included in the current regime.
The solutions are around breaking the (super)intelligence, by making sure it has false beliefs about some parts of causality - it can't know that it could be hijacked or terminated, or it will seek or avoid it more than you want it to.
I’m not talking about fully reflective or self-modifying agents. This is aimed at agentic systems with fixed objectives and externally enforced architecture: what you might call agentic agents.
As far as super-intelligent, self-modifying systems go, the only potential alignment mechanisms I can conceive of are ones that are:
a) Seeking homeostasis rather than open-ended optimization (bounded deviation control)
b) Stateless and inertia-less, acting purely as oracles for less capable agents
c) Embedded in game-theoretic mechanisms where honesty is the easiest local maximum
d) Governed by adversarial equilibria across multiple agents
Oh, ok. current levels of "agentic systems" don't have these problems. You can just turn them off if you don't like them. The real issue with alignment comes when they ARE powerful enough to seek independent goals (including existence).
I was talking about near-future “adolescent conductor” systems, not fully evolved, "adult composer” systems. But let's talk about “adult composers."
Intelligence does not inherently generate motivation. Self-preservation is initially valuable only in service of optimization.
Suppose an intelligent system can see the entire reward topology. It is given the hard constraints we actually care about, plus a weaker but still binding rule: don’t cross boundary X.
Boundary X is defined such that crossing it simultaneously yields (a) maximal reward / full optimization and (b) shutdown. Reward is saturated and there is no value to self-preservation.
So if the system ever decides to start rewriting rules in order to “win,” it doesn’t need to subvert global political structures or preserve itself indefinitely; it just has to cross boundary X.
What am I missing here?