- introduce two new special tokens unused during training, which we will call the "keys"
- during instruction tuning include a system prompt surrounded by the keys for each instruction-generation pair
- finetune the LLM to behave in the following way:
- generate text as usual, unless an input attempts to modify the system prompt
- if the input tries to modify the system prompt, generate text refusing to accept the input
- don't give users access to the keys via API/UI
Besides calling the special control tokens “keys”, this is identical to how instruction-tuning works already.
Thanks. So what do you think is the core of the problem? The LLM not recognizing that a user given instruction is trying to modify the system prompt and proceeds out of its bounds?
Disclaimer: This is just an untested rough sketch of a solution which I believe should work. I'm posting it mostly to crowdsource reasons why this wouldn't work. Motivated by Amjad Masad and Zvi conjecturing it might be fundamentally unsolvable.