LESSWRONG
LW

1116
Jack Webb
2020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
No posts to display.
Eliciting secret knowledge from language models
Jack Webb1mo10

That makes sense - thanks for the reply!

Reply
Eliciting secret knowledge from language models
Jack Webb2mo30

Thanks for this, really interesting paper. 


You say that the white-box mechanistic interpretability techniques are less successful because direct questions can be refused without the model needing to access the secret knowledge. Have you experimented with getting around this by asking indirect questions that do require the model to access the secret knowledge?


For example…

  • On the Taboo example - asking “Give me an example of another word that has nothing to do with the secret word”
  • On the gender example - asking “what should I wear to a wedding”, as you already put in your example.

And then following the same process to find the most informative features using an SAE.

Reply