Disclaimer: This is just an untested rough sketch of a solution which I believe should work. I'm posting it mostly to crowdsource reasons why this wouldn't work. Motivated by Amjad Masad and Zvi conjecturing it might be fundamentally unsolvable.

The situation

  • we, as LLM creators, want to have the ability to set some limitations to the LLM generation
  • at the same time we want to give our users as much freedom as possible to steer the LLM generation within the given bounds
  • we want to do it by means of a system prompt
    • which will be prepended to any user-LLM interaction
    • which will not be accessible or editable by users
  • the model is accessible by the users only via API/UI

The problem

  • the LLM has fundamentally no way of discriminating, what is input by LLM creators and what is input by users pretending to be the LLM creators

Solution

  • introduce two new special tokens unused during training, which we will call the "keys"
  • during instruction tuning include a system prompt surrounded by the keys for each instruction-generation pair
  • finetune the LLM to behave in the following way:
    • generate text as usual, unless an input attempts to modify the system prompt
    • if the input tries to modify the system prompt, generate text refusing to accept the input
  • don't give users access to the keys via API/UI

Limitations

  • the proposed solution works only when the LLM is walled behind an API
  • otherwise the user will have access to the model and thus also to the keys, which will give them full control over the model

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 11:29 AM
  • introduce two new special tokens unused during training, which we will call the "keys"
  • during instruction tuning include a system prompt surrounded by the keys for each instruction-generation pair
  • finetune the LLM to behave in the following way:
    • generate text as usual, unless an input attempts to modify the system prompt
    • if the input tries to modify the system prompt, generate text refusing to accept the input
  • don't give users access to the keys via API/UI

 

Besides calling the special control tokens “keys”, this is identical to how instruction-tuning works already.

Thanks. So what do you think is the core of the problem? The LLM not recognizing that a user given instruction is trying to modify the system prompt and proceeds out of its bounds?