Can a Nested Viable Systems Architecture solve reward hacking?

Han Kay

Rejected for the following reason(s):

Insufficient Quality for AI Content.
We are sorry about this, but submissions from new users that are mostly just links to papers on open repositories (or similar) have usually indicated either crackpot-esque material, or AI-generated speculation.
Difficult to evaluate, with potential yellow flags.

Read full explanation

I have been exploring this question for quite some time now and recently published a preprint proposing a nested VSM architecture for the purpose. Coming from a systems/cybernetics background, honestly, it received a traction I haven't expected, so I pulled together the courage to share it with the "hardcore alignment people" here on this platform.

My core hypothesis is that we replace pure reward maximization with what may be called viability maximization. My research questions were simple:

Why not move beyond the flat transformer achitecture to a nested Tri-Level Controller? An embodied controller (action), a supervisory controller (policy selection) and a meta-controller (prior generator).
Why base everything on simple utility? How about a coherence metric (similar to "Free Energy minimization" or "inner alignment") as the objective function for frame selection. The formula: Value = Coherence + Utility - Cost. Since "coherence" acts as a homeostatic contraint, this could potentially prevent the system from pursuing structurally unstable states that are high-reward.
Can we gate the system's ability to increase policy complexity based on "Time-Integrated Coherence"? Just like a person's "credit history", the model earns the "right" to use more compute only if it demonstrates a history of coherent internal states.

I have provided algorithmic sketches and experimental protocols for those interested: https://zenodo.org/records/17943102

I would appreciate critique on the formalization of the coherence metric and the control topology.

LESSWRONG
LW

LESSWRONG
LW

1

Can a Nested Viable Systems Architecture solve reward hacking?

1

1

1