LESSWRONG
LW

1103
Honglu Fan
2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Password-locked models: a stress case for capabilities evaluation
Honglu Fan2y30

If we replace the mention of "password-locked model" by "a pair of good and bad models, plus a hardcoded password verifier that reroutes every query", is there anything that the latter cannot do and the former can do? The ideas are really great in this article, especially the red-teaming part. But as a red-teaming test subject, it could simply be hardcoded variants which are 1. cheaper, 2. more reliably deceptive.

But I got the vague feeling that training a password-locked model might be interesting in pure interpretability research such as whether separating capabilities like this makes it easier to interpret (given that most neurons are polysemantic and are normally hard to interpret) or something like that.

Reply
No wikitag contributions to display.
No posts to display.