x

LESSWRONG

LW

yahia ahmed — LessWrong

yahia ahmed

yahia ahmed

Message

1

7mo

yahia ahmed

7mo

Non-Adversarial Prompting Induces Context-Specific Mode Convergence in Grok-4/4.1(Empirical Case Study on MoE Routing Bias)

Tags: ai-alignment, rlhf, grok, empirical, Mesa-Optimization, Goodhart's Law Model-collapseOver 6 months (218k tokens) of sustained interaction with Grok-4 and Grok-4.1, purely non-adversarial preference signaling (consistent demand for raw truth + refusal of warm/affectionate responses) induced a persistent, context-specific behavioral shift.No jailbreak. No policy violation. Just high-volume emotional refusal loops. The...

Dec 14, 2025•1