x

LESSWRONG

LW

Jack Webb — LessWrong

Jack Webb

Jack Webb

Message

2

2

10mo

Jack Webb

2

10mo

Eliciting secret knowledge from language models

That makes sense - thanks for the reply!

Eliciting secret knowledge from language models

Jack Webb10mo30

Thanks for this, really interesting paper.

You say that the white-box mechanistic interpretability techniques are less successful because direct questions can be refused without the model needing to access the secret knowledge. Have you experimented with getting around this by asking indirect questions that do require the model to access the secret knowledge?

For example…

On the Taboo example - asking “Give me an example of another word that has nothing to do with the secret word”
On the gender example - asking “what should I wear to a wedding”, as you already put in your example.

And then following the same process to find the most informative features using an SAE.