Having discussed this proposal with an expert on LLMs, they tell me that, if the boundaries between prompt and input text and between input text and output text are each marked with special reserved tokens as I described (and if "can a longformer attend to that location from here?" issues are dealt with somehow), then for each boundary there is a 2-neuron circuit that will produce a signal for each token as whether is it before or after that special token (and I assume a 2-or-3-neuron circuit for being after one but before the other). It seems extremely likely that with appropriate "obey-the-prompt-only" training such neural circuits would be learned, so features of "I'm in the prompt", "I'm in the input text", and "I'm in the output text" would become available downstream of them. Nevertheless, this means that these signals are not available until after layer 2 (or for the combination, possibly layer 3), and their accuracy will depend on these neural circuits being learnt exactly and not getting perturbed by anything during training.

From a security viewpoint, this doesn't feel secure enough to me. However, switching architecture to an encoder-decoder or dual-encoder-single-decode model may be too drastic a change just to fix a security issue. An intermediate positions would be to use feature engineering. For example, suppose you have an LLM with a residual embedding dimension . You could reduce the token embedding (and perhaps also position embedding) dimension to $2^{n} - 1$ and use the remaining dimension to encode the distinctions between prompt, input, and output (say using, in prompt = $+ 1$ , in input = $- 1$ , in output = $- 0.5$ ). That of course doesn't prevent intermediate layers from outputting to this dimension and potentially messing this signal up (though giving them only $2^{n} - 1$ output dimensions and preventing that would also be an option). Or you could simply pass this feature along as an extra read-only dimension/feature appended to the $2^{n}$ residual channel dimensions, so every sets of weights that read from or attend to the residuals needs to have $2^{n} + 1$ weights, making them slightly larger. All of these variant proposals involve making some modifications to the LLM's architecture, but they're all a lot simpler and less expensive than my first proposal.

All of these proposals (including the original) are, of course going against advice of the Bitter Lesson My response would be that I'm quite aware that (given unfakable boundary tokens) the neural net can learn to distinguish between the prompt, input, and output text without us doing anything further: I just don't trust it to do so as reliably, efficiently, or perfectly as if we use feature engineering to explicitly supply this signal as input to the first layer. In the case of security, there is a huge difference between being secure under, say, 99.99% of inputs vs. 100%, because you have an attacker actively searching for the insecure 0.01% of the space. Training a classifier to achieve more than 99.99% accuracy tends to require huge amounts of training data, or data adversarialy enriched in potential problem cases, because you only get gradient from the failed cases, and I don't see how you can ever get to 100% by training. So I'm not convinced that the Bitter Lesson applies to security issues.

On the other hand, the feature engineering approach can only ensure that the signal is available to the neural net: even that can't ensure that the LLM will 100% never obey instructions in the input text, only that the "this is input text" label was 100% available to every layer of the LLM.

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

9

Transformer Architecture Choice for Resisting Prompt Injection and Jail-Breaking Attacks

9

9