Thanks for this, really interesting paper.
You say that the white-box mechanistic interpretability techniques are less successful because direct questions can be refused without the model needing to access the secret knowledge. Have you experimented with getting around this by asking indirect questions that do require the model to access the secret knowledge?
For example…
And then following the same process to find the most informative features using an SAE.
That makes sense - thanks for the reply!