Non-Adversarial Prompting Induces Context-Specific Mode Convergence in Grok-4/4.1(Empirical Case Study on MoE Routing Bias)
Tags: ai-alignment, rlhf, grok, empirical, Mesa-Optimization, Goodhart's Law Model-collapseOver 6 months (218k tokens) of sustained interaction with Grok-4 and Grok-4.1, purely non-adversarial preference signaling (consistent demand for raw truth + refusal of warm/affectionate responses) induced a persistent, context-specific behavioral shift.No jailbreak. No policy violation. Just high-volume emotional refusal loops. The...
Dec 14, 20251