A case for robust AI benevolence rather than human control

I've been thinking about the limitations of control-based AI safety frameworks and wrote up a case for robust benevolence as a more stable equilibrium for autonomous agents. Arguing that as AI systems gain autonomy, intrinsic value alignment (think: empathy/parental care as a motivational structure) is more robust than extrinsic constraints like obedience or reward.

Full essay here: https://zenodo.org/records/18060672

Happy to discuss — particularly interested in pushback on whether benevolence is actually more stable than obedience under capability gain.

Mod note: this post violates our LLM Writing Policy for LessWrong and was incorrectly approved, so I have delisted the post to make it only accessible via link. I've not returned it to your drafts, because that would make the comments hard to access.

@Grégory Lielens, please don't post more direct LLM output, or we'll remove your posting permissions.

The most obvious pushback that I can come up with is that Moloch is not a superintelligence. Instead, it can only seduce individuals or individual corporations to act aggressively and to hope that the community as a whole didn't establish a retailation system, then have the others lose to the aggressor (edit: or unite their efforts against the aggressor).

Additionally, I am not sure that the idea of the value-aligned AI is absent from SOTA discourse, see, e.g., Corrigibility Scales To Value Alignment, Seth Herd's case for ethics requiring a solution BEFORE a value-aligned AI is created or even Cannell's idea that LOVE in a simbox is all you need.

As for benevolence being more stable than obedience, the AI might have an erroneous idea of what is good for the humans. Suppose that GPT-4o decided that being praised is good for the humans and started to force-feed the users with flattery. How would your paradigm re-align the model? And what about the AI making more dangerous mistakes like validating the users into psychosis?

Politics-related nitpick

Finally, and this appears to be a popular and erroneous sentiment, you write that there is "massive empirical evidence that large human organizations are dangerous, but only theoretical evidence that autonomous AI would be." Historically, large human organisations also had the function of coordinating the humans' activity in order to achieve a result better than the humans would've achieved without uniting their efforts. Think of hydraulic empires which were based on having the collectives maintain irrigation and using the water to produce more crops. The reason why autonomous AI has yet to demonstrate danger (or has it already demonstrated dangers like being usable for automated cyberattacks?) is its lack of capabilities.

@Grégory Lielens, please don't post more direct LLM output, or we'll remove your posting permissions.

Politics-related nitpick

-1

A case for robust AI benevolence rather than human control

-1

-1

-1