LESSWRONG
LW

Drejc Pesjak
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
On the Practical Applications of Interpretability
Drejc Pesjak8mo10

A friend recently asked me: how can I easily (ie. no training) create a model that only uses a limited subset of vocabulary? I thought about this question, and the main thing I could imagine was prompt-related: sticking in a few examples to the beginning of the prompt, repeatedly prompting the model until only certain words were used, or taking out words and sticking the processed text back in. You could also try finetuning, but you'd need to gather a lot of data and hope that the performance doesn't degrade in other tasks.

Wouldn't this be very simple, at the last layer just clip the probabilities of "forbidden words" to zero.

Reply
No wikitag contributions to display.
No posts to display.