If you know the exact last-token state for an unknown prompt (that is, the probabilities assigned to each possible next token), then just because there are countably many prompts and (abstractly, precision matters some amount here) uncountably many possible end states, in practice we should expect that that last-token state corresponds to only one possible prompt, and we can reverse-engineer what that prompt was without too much difficulty (there is some difficulty, we don't know prompt length, math is at least a bit hard here but it's not that hard).

But this doesn't do what you want it to do: most probability distributions on the next token are not the last-token state for any prompt, so we can't use this to find magic prompts. The "output" of the model is not just the token it selects, it's the full set of logits.

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

-8

[ Question ]

How Important is Inverting LLMs?

-8

-8