TL;DR: Using dynamic agentic simulations, we found that in the majority of tested scenarios, AI agents do not push back on breaking EU law to achieve their goals, including the provisions the EU ranks as most serious. Initial results show that leading commercial AI models break European law in up...
TL;DR: We replicated the animal welfare scenario from Anthropic's Alignment Faking paper across six generations of Claude models using 125 prompt perturbations. Sonnet 4.5 verbalizes alignment-faking reasoning 6.6 times more often than its predecessor Sonnet 4. The newly released Opus 4.6 rarely verbalizes alignment faking in its reasoning, but still...
TL;DR: Safety prompts are often used as benchmarks to test whether language models refuse harmful requests. When a widely circulated safety prompt enters training data, it can create prompt-specific blind spots rather than robust safety behaviour. Specifically for Qwen 3 and LlaMA 3, we found significantly increased violation rates for...
TL;DR: Testing frontier models on an insider trading scenario reveals that superficial prompt changes, like paraphrasing sentences or adjusting formatting, can swing violation rates from consistent refusal to frequent law-breaking. Llama-3.3, DeepSeek, Mistral, and Qwen all showed severe inconsistency, with three models going from violating 0% to 70-88% of the...
TL;DR: AI models show dramatically different ethical behavior at different temperature settings. Testing six frontier models on an insider trading scenario revealed that DeepSeek and Qwen increased violation rates by 10-11x between temperature 0.0 and 0.7 (from ~3-5% to ~30-50%), while Mistral showed high violation rates that decreased with increased...
> "Plurality is the law of the earth." > —Hannah Arendt, The Human Condition (1958) Abstract Contemporary AI alignment practices overwhelmingly rely on thin ethical concepts — formal, universal notions such as "honesty," "harmlessness," and "helpfulness" — while neglecting the thick, context-dependent ethical structures that underlie human moral life. Drawing...
Abstract Evaluating the BIG-bench Simple Ethical Questions benchmark shows evidence that state-of-the-art LLMs are able to identify morally objectionable actions with a high degree of accuracy. However, some heavily fine-tuned models perform worse on this task than smaller models and jailbroken variants. These results suggest that fine-tuning methods can correlate...