This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
I have been working on a control-theoretic approach to inner alignment, specifically looking at how biological systems handle safety. They rarely use hard constraints like "never do X" because those are brittle. Instead, they seem to use coherence as a depleted resource. I am trying to map this to AI agents using Stafford Beer’s Viable System Model.
The core idea I'm playing with is something I call Time-Integrated Coherence (TIC). Instead of just checking if an action is safe right now, the agent has to "spend" accumulated coherence to execute complex plans. Basically, if the agent is acting in alignment with its priors over time, it builds up TIC. If it starts scheming or deviating, TIC drops. The kicker is that high-complexity policies are structurally gated by this resource. If the "safety battery" is empty, the agent physically cannot execute the complex plan, forcing it back to simpler behaviors.
Current RLHF feels like we are patching behavior post-hoc. I suspect we need a structural "physics of safety" where misalignment physically drains the agent's ability to act, rather than just punishing it.
I am curious if anyone else here has tried using accumulated coherence rather than instantaneous reward as a gating metric? I am currently running simulations to see if this actually prevents "coherent but misaligned" behavior or just delays it.
I uploaded a preprint on Zenodo for those who want to see the formalization: https://zenodo.org/records/17943102. It is a full architectural proposal, so it is fairly dense. I am mostly looking for feedback here on the specific "safety battery" mechanism above, rather than a full review of the paper.
I have been working on a control-theoretic approach to inner alignment, specifically looking at how biological systems handle safety. They rarely use hard constraints like "never do X" because those are brittle. Instead, they seem to use coherence as a depleted resource. I am trying to map this to AI agents using Stafford Beer’s Viable System Model.
The core idea I'm playing with is something I call Time-Integrated Coherence (TIC). Instead of just checking if an action is safe right now, the agent has to "spend" accumulated coherence to execute complex plans. Basically, if the agent is acting in alignment with its priors over time, it builds up TIC. If it starts scheming or deviating, TIC drops. The kicker is that high-complexity policies are structurally gated by this resource. If the "safety battery" is empty, the agent physically cannot execute the complex plan, forcing it back to simpler behaviors.
Current RLHF feels like we are patching behavior post-hoc. I suspect we need a structural "physics of safety" where misalignment physically drains the agent's ability to act, rather than just punishing it.
I am curious if anyone else here has tried using accumulated coherence rather than instantaneous reward as a gating metric? I am currently running simulations to see if this actually prevents "coherent but misaligned" behavior or just delays it.
I uploaded a preprint on Zenodo for those who want to see the formalization: https://zenodo.org/records/17943102. It is a full architectural proposal, so it is fairly dense. I am mostly looking for feedback here on the specific "safety battery" mechanism above, rather than a full review of the paper.