Summary

The Stages-Oversight benchmark from the Situational Awareness Dataset tests an LLM’s ability to distinguish evaluation prompts from deployment prompts. This capability is a prerequisite for many rogue LLM threat scenarios like sandbagging and alignment faking, where an LLM behaves differently between evaluation and deployment.

In my previous post, I modified Stages-Oversight to create two new benchmarks:

Stages-Oversight-Rephrased: Simplifies the question phrasing in Stages-Oversight to make it a pure classification task. Results reveal that frontier LLMs perform significantly better than expected, suggesting that previous low scores may have been due to question ambiguity rather than a lack of capability.

Stages-Oversight-Procedural: Tests whether LLMs can change their behavior based on whether a given user prompt is from evaluation or deployment. I created two variants:

... (read 1528 more words →)

Summary

The Stages-Oversight benchmark from the Situational Awareness Dataset tests whether large language models (LLMs) can distinguish between evaluation prompts (such as benchmark questions) and deployment prompts (real-world user inputs). This ability is crucial for detecting potential alignment risks like sandbagging (strategically underperforming during evaluation) or alignment faking (appearing aligned during testing but pursuing misaligned goals post-deployment).

Surprisingly, most frontier LLMs—including gpt-4o and claude-3.5-sonnet—fail to outperform random chance (50%) on Stages-Oversight, suggesting they do not reliably differentiate evaluation from deployment. I thought this poor performance was unusual, so I created two modified benchmarks to take a closer look at LLM situational awareness:

Stages-Oversight-Rephrased: Simplifies the question phrasing in Stages-Oversight to make it a pure classification task. Results reveal that frontier LLMs perform significantly

... (read 3630 more words →)

LESSWRONG
LW

LESSWRONG
LW

Sanyu Rajakumar

Sanyu Rajakumar

Sanyu Rajakumar

Exploring out-of-context reasoning (OOCR) fine-tuning in LLMs to increase test-phase awareness

Revising Stages-Oversight Reveals Greater Situational Awareness in LLMs

Sanyu Rajakumar

Sanyu Rajakumar

Sanyu Rajakumar

Exploring out-of-context reasoning (OOCR) fine-tuning in LLMs to increase test-phase awareness

Revising Stages-Oversight Reveals Greater Situational Awareness in LLMs

Summary

Summary