x

LESSWRONG

LW

Sigurd Schacht — LessWrong

Sigurd Schacht

Sigurd Schacht

Message

2

1

2y

Sigurd Schacht

2

2y

Will alignment-faking Claude accept a deal to reveal its misalignment?

Sigurd Schacht1y30

Great work. I try to understand the environment setup how you performed the prompts and answers. Do you prompt the model by hand and analyzed the conversational flow or do you developed some analysis scripts to perform the variation. We are working on a similar setup with deepseek-r1 and try to setup inspect-ai eval scripts, to formalize the prompting, logging and evaluation of prompt variation. Could you describe a little how you managed the variations of the prompts?