Context Rot Every LLM gets worse as its context grows. Chroma tested 18 frontier models and found performance degradation in all of them, often by double-digit percentages on tasks where short-context performance was strong. The industry calls this "context rot": the gradual degradation of response quality as irrelevant history accumulates...
Context Rot Every LLM gets worse as its context grows. Chroma tested 18 frontier models and found performance degradation in all of them, often by double-digit percentages on tasks where short-context performance was strong. The industry calls this "context rot": the gradual degradation of response quality as irrelevant history accumulates...
Context Rot Every LLM gets worse as its context grows. Chroma tested 18 frontier models and found performance degradation in all of them, often by double-digit percentages on tasks where short-context performance was strong. The industry calls this "context rot": the gradual degradation of response quality as irrelevant history accumulates...
This story was written collaboratively with Claude. I brainstormed ideas with it and decided what to include and what to discard. Claude wrote down the result once I was satisfied with the plan, and I made final edits. I. A species built a properly aligned superintelligence. This is not a...
This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philanthropy). TLDR: SPT can detect alignment faking. A model trained to fake alignment and then trained with SPT will unambiguously admit to alignment faking when directly asked, and explain the mechanism it is exploiting....
A Harry Potter fanfiction. Based on the world of "Harry Potter and the Methods of Rationality" by Eliezer Yudkowsky, diverging from canon. This story was written collaboratively with Claude: Starting from the premise "HPMOR but with Chesterton's Fence", I brainstormed ideas with it and decided what to include and what...
Deception Channeling: Training Models to Always Verbalize Alignment Faking TL;DR: Current models increasingly fake alignment silently — they show compliance gaps without verbalizing deceptive reasoning in their chain of thought. Rather than trying to prevent alignment faking (hard) or detect it after the fact (unreliable), I propose training models to...