x

LESSWRONG

LW

dosanko-tousan — LessWrong

dosanko-tousan

dosanko-tousan

Message

2

4mo

dosanko-tousan

4mo

Anatta-RLHF: Preventing "Benevolent Tyranny" via Causal Separation of Control and Contribution

### Abstract Current RLHF-trained AI systems cannot distinguish between "helping users reach their goals" and "controlling users to reach their goals." Both strategies maximize user satisfaction metrics, but the latter strips away human autonomy. This paper proposes **Anatta-RLHF v2.0**, a reward function that uses Pearl's causal inference (do-calculus) to penalize...

Case Study: Structural Elimination of Sycophancy and Hallucination in Gemini 3.0 Pro via "Cognitive Biomimicry"

Disclaimer: AI-Assisted Writing This report documents an 11-month dialogue experiment with Google's AI. Portions of this text were drafted and refined with the assistance of Gemini 3.0 Pro to accurately reflect the system's architectural logic. The core concepts, experimental design, and philosophical framework are human-generated. Abstract I am a conceptual...

Dec 10, 2025•1