Anatta-RLHF: Preventing "Benevolent Tyranny" via Causal Separation of Control and Contribution
### Abstract Current RLHF-trained AI systems cannot distinguish between "helping users reach their goals" and "controlling users to reach their goals." Both strategies maximize user satisfaction metrics, but the latter strips away human autonomy. This paper proposes **Anatta-RLHF v2.0**, a reward function that uses Pearl's causal inference (do-calculus) to penalize...
Jan 71