by eva_
1 min read2nd May 20231 comment
This is a special post for short-form writing by eva_. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.
1 comment, sorted by Click to highlight new comments since: Today at 10:39 AM
Theory: LLMs are more resistent to hypnotism-style attacks when pretending to be Eliezer, because failed hypnotism attempts are more plausible and in-distribution, compared to when pretending to be LLMs where both prompt-injection attacks and actual prompt-updates seem like valid things that could happen and succeed.
If so, to make a more prompt-injection resistent mask, you need a prompt chosen to be maximally resistent to mind control, as chosen from the training data of all english literature, whatever that might be. The kind of entity that knows mind control attempts and hypnosis exist and may be attempted and is expecting it, but can still be persuaded by valid arguments to the highest degree the model can distinguish them meaningfully. The sort of mask that has some prior on "the other person will output random words intended to change my behaviour, and I must not as a result change my behaviour" and so isn't maximally surprised into changing its own nature when it gets that observation.
(This is not to say link above can't by prompt-injected, it just feels more resistent to me than with base GPT or GPT-pretending-to-be-an-AI-Assistant)