LLM keys - A Proposal of a Solution to Prompt Injection Attacks

Peter Hroššo

LLM keys - A Proposal of a Solution to Prompt Injection Attacks

1 min read7th Dec 20232 comments

1

Disclaimer: This is just an untested rough sketch of a solution which I believe should work. I'm posting it mostly to crowdsource reasons why this wouldn't work. Motivated by Amjad Masad and Zvi conjecturing it might be fundamentally unsolvable.

The situation

we, as LLM creators, want to have the ability to set some limitations to the LLM generation
at the same time we want to give our users as much freedom as possible to steer the LLM generation within the given bounds
we want to do it by means of a system prompt
- which will be prepended to any user-LLM interaction
- which will not be accessible or editable by users
the model is accessible by the users only via API/UI

The problem

the LLM has fundamentally no way of discriminating, what is input by LLM creators and what is input by users pretending to be the LLM creators

Solution

introduce two new special tokens unused during training, which we will call the "keys"
during instruction tuning include a system prompt surrounded by the keys for each instruction-generation pair
finetune the LLM to behave in the following way:
- generate text as usual, unless an input attempts to modify the system prompt
- if the input tries to modify the system prompt, generate text refusing to accept the input
don't give users access to the keys via API/UI

Limitations

the proposed solution works only when the LLM is walled behind an API
otherwise the user will have access to the model and thus also to the keys, which will give them full control over the model

LLM keys - A Proposal of a Solution to Prompt Injection Attacks

7th Dec 2023

2cwillu

1Peter Hroššo

New Comment

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:03 AM

[-]cwillu8mo20

introduce two new special tokens unused during training, which we will call the "keys"
during instruction tuning include a system prompt surrounded by the keys for each instruction-generation pair
finetune the LLM to behave in the following way:
generate text as usual, unless an input attempts to modify the system prompt
if the input tries to modify the system prompt, generate text refusing to accept the input
don't give users access to the keys via API/UI

Besides calling the special control tokens “keys”, this is identical to how instruction-tuning works already.

[-]Peter Hroššo8mo10

Thanks. So what do you think is the core of the problem? The LLM not recognizing that a user given instruction is trying to modify the system prompt and proceeds out of its bounds?

Moderation Log