Can a Nested Viable Systems Architecture solve reward hacking?
I have been exploring this question for quite some time now and recently published a preprint proposing a nested VSM architecture for the purpose. Coming from a systems/cybernetics background, honestly, it received a traction I haven't expected, so I pulled together the courage to share it with the "hardcore alignment...