LESSWRONG
LW

2246
Sigurd Schacht
2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Will alignment-faking Claude accept a deal to reveal its misalignment?
Sigurd Schacht7mo30

Great work. I try to understand the environment setup how you performed the prompts and answers. Do you prompt the model by hand and analyzed the conversational flow or do you developed some analysis scripts to perform the variation. We are working on a similar setup with deepseek-r1 and try to setup inspect-ai eval scripts, to formalize the prompting, logging and evaluation of prompt variation. Could you describe a little how you managed the variations of the prompts?

Reply
No posts to display.