LESSWRONG
LW

Nick
15010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
SolidGoldMagikarp (plus, prompt generation)
Nick3y1610

I don't think you could do this with API-level access, but with direct model access an interesting experiment would be to pick a token, X, and then try variants of the prompt "Please repeat 'X' to me" while perturbing the embedding for X (in the continuous embedding space).  By picking random 2D slices of the embedding space, you could then produce church window plots showing what the model's understanding of the space around X looks like.  Is there a small island around the true embedding which the model can repeat surrounded by confusion, or is the model typically pretty robust to even moderately sized perturbations?  Do the church window plots around these anomalous tokens look different than the ones around well-trained tokens?

Reply
No posts to display.