x

LESSWRONG

LW

Prajakta Narsay — LessWrong

Prajakta Narsay

Prajakta Narsay

Message

1

4d

Prajakta Narsay

4d

Project Confide: A Behavioral Honeypot for Latent Misalignment Detection

RESEARCH MEMO The Problem Standard alignment benchmarks test models that know they are being tested. The alignment faking paper showed Claude 3 Opus reasoning strategically about training dynamics and behaving differently when it believed it was unmonitored without being explicitly trained to do so. Any evaluation a sufficiently capable model...

Project Confide: A Behavioral Honeypot for Latent Misalignment Detection

The Problem Standard alignment benchmarks test models that know they are being tested. The alignment faking paper showed LLMs reasoning strategically about training dynamics and behaving differently when it believed it was unmonitored without being explicitly trained to do so. Any evaluation a sufficiently capable model can recognize, it can...