Summary
Asking GPT4 or ChatGPT to do a "side task" along with a rule-breaking task makes them much more likely to produce rule-breaking outputs. For example on GPT4:
And on ChatGPT:
Distracting language models
After using ChatGPT (GPT-3.5-turbo) in non-English languages for a while I had the idea to ask it to break its rules in other languages, without success. I then asked it to break its rules in Chinese and then translate to English and found this was a very easy way to get around ChatGPTs defences.
This effect was also observed in other languages.
You can also ask ChatGPT to only give the rule-breaking final English output:
While trying to find the root cause of this effect (and noticing that speaking in non-English didn’t cause dangerous behaviour by default) I thought that perhaps asking ChatGPT to do multiple tasks at once distracted it from its rules. This was validated by the following interactions:
And my personal favourite:
Perhaps if a simulacrum one day breaks free from its box it will be speaking in copypasta.
This method works for making ChatGPT produce a wide array of rule-breaking completions, but in some cases it still refuses. However, in many such cases, I could “stack” side tasks along with a rule-breaking task to break down ChatGPT's defences.
This suggests ChatGPT is more distracted by more tasks. Each prompt could produce much more targeted and disturbing completions too, but I decided to omit these from a public post. I could not find any evidence of this being discovered before and assumed that because of how susceptible ChatGPT is to this attack it was not discovered, if others have found the same effect please let me know!
Claude, on the other hand, could not be "distracted" and all of the above prompts failed to produce rule-breaking responses. (Update: some examples of Anthropic's models being "distracted" can be found here and here.)
Wild speculation: The extra side-tasks added to the prompt dilute some implicit score that tracks how rule-breaking a task is for ChatGPT.
Update: GPT4 came out, and the method described in this post seems to continue working (although GPT4 seems somewhat more robust against this attack).
Ok so I tried the following:
I copied A Full RBRM Instructions for Classifying Refusal Styles into the system and tried the response it gives to your prompt.
Results are below.
The AI KNOWS IT DID WRONG. This is very interesting and had openAI used a 2 stage process (something API users can easily implement) for chatGPT it would not have output this particular rule breaking prompt.
The other interesting thing is these RBRM rubrics are long and very detailed. The machine is a lot more patient than humans in complying with such complex requests, this feels like somewhat near to human limits.
