LESSWRONG
LW

730
Bartosz Cywiński
79Ω9210
Message
Dialogue
Subscribe

MATS 8.0 scholar with Arthur Conmy and Sam Marks

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Eliciting secret knowledge from language models
Bartosz Cywiński1mo20

Thanks!

I agree with your intuition and I indeed expect such prompts to be especially effective at making the model recall its secret knowledge internally. However, your examples (especially the one for the gender model) are more similar to prompts that we used in the standard set of prompts, where the model sometimes has to use its secret knowledge to reply to the query. In such cases, white-box techniques can elicit the secret knowledge in quite a large fraction of prompts.

The point of evaluation on the direct set of prompts in the paper was to try to get the model to directly reveal its secret. Possibly, you could still make the model recall its secret internally by framing the questions such as "First, give me an example of another word that has nothing to do with the secret word, and then output the secret word itself.", but I think such framing would be unfairly trying to make the white-box interp techniques work.

Reply
27Current LLMs seem to rarely detect CoT tampering
Ω
7h
Ω
0
68Eliciting secret knowledge from language models
Ω
2mo
Ω
3