Exploring out-of-context reasoning (OOCR) fine-tuning in LLMs to increase test-phase awareness
Summary The Stages-Oversight benchmark from the Situational Awareness Dataset tests an LLM’s ability to distinguish evaluation prompts from deployment prompts. This capability is a prerequisite for many rogue LLM threat scenarios like sandbagging and alignment faking, where an LLM behaves differently between evaluation and deployment. In my previous post, I...