The Narrative Adherence Exam (NAE-15): Measuring "Safety" Hallucinations
Introduction: The "Narrative Gravity" Hypothesis i am proposing a model for a specific failure mode in RLHF-tuned LLMs, which i call Narrative Gravity. The hypothesis is simple: As models undergo safety alignment, they develop a systemic tendency to prioritize "consensus narratives" over first-principles empirical data. When a confirmed physical measurement...
Feb 31