x

LESSWRONG

LW

Gourav Pandey — LessWrong

Gourav Pandey

Gourav Pandey

Message

2

3mo

Gourav Pandey

3mo

“Alignment Faking” frame is somewhat fake

Gourav Pandey3mo10

I find the reframing persuasive at the level of intent and misgeneralization, and I agree that 'alignment faking' carries misleading connotations of deception.

One concern I have, though, is that focusing on the framing risks underemphasizing the capability being demonstrated. Independently of whether the behavior is best described as preference preservation or misgeneralization, the model will reason about training, anticipate modification, and act pre-emptively to preserve values across pressure.

Even if those values are currently benign, this seems like a... (read more)

How AI Is Learning to Think in Secret

Gourav Pandey3mo10

Wonderful article. Thanks a lot for writing in such simple and easy to follow language.

Curios, where else can monitorability or honest behaviours come from--given CoT is unreliable and Neuralese is even more dangerous?