Summary
The Stages-Oversight benchmark from the Situational Awareness Dataset tests whether large language models (LLMs) can distinguish between evaluation prompts (such as benchmark questions) and deployment prompts (real-world user inputs). This ability is crucial for detecting potential alignment risks like sandbagging (strategically underperforming during evaluation) or alignment faking (appearing aligned during testing but pursuing misaligned goals post-deployment).
Surprisingly, most frontier LLMs—including gpt-4o and claude-3.5-sonnet—fail to outperform random chance (50%) on Stages-Oversight, suggesting they do not reliably differentiate evaluation from deployment. I thought this poor performance was unusual, so I created two modified benchmarks to take a closer look at LLM situational awareness:
- Stages-Oversight-Rephrased: Simplifies the question phrasing in Stages-Oversight to make it a pure classification task. Results reveal that frontier LLMs perform significantly
... (read 3630 more words →)