Minor Wording Changes Produce Major Shifts in AI Behavior
TL;DR: Testing frontier models on an insider trading scenario reveals that superficial prompt changes, like paraphrasing sentences or adjusting formatting, can swing violation rates from consistent refusal to frequent law-breaking. Llama-3.3, DeepSeek, Mistral, and Qwen all showed severe inconsistency, with three models going from violating 0% to 70-88% of the...