x

LESSWRONG

LW

Derck Prinzhorn — LessWrong

Derck Prinzhorn

Derck Prinzhorn

Message

7

2y

Derck Prinzhorn

7

2y

Minor Wording Changes Produce Major Shifts in AI Behavior

by Daan Henselmans and Derck Prinzhorn

TL;DR: Testing frontier models on an insider trading scenario reveals that superficial prompt changes, like paraphrasing sentences or adjusting formatting, can swing violation rates from consistent refusal to frequent law-breaking. Llama-3.3, DeepSeek, Mistral, and Qwen all showed severe inconsistency, with three models going from violating 0% to 70-88% of the...

Nov 26, 2025•3

Low-Temperature Evaluations Can Mask Critical AI Behaviors

by Daan Henselmans and Derck Prinzhorn

TL;DR: AI models show dramatically different ethical behavior at different temperature settings. Testing six frontier models on an insider trading scenario revealed that DeepSeek and Qwen increased violation rates by 10-11x between temperature 0.0 and 0.7 (from ~3-5% to ~30-50%), while Mistral showed high violation rates that decreased with increased...

Nov 13, 2025•8