Daan Henselmans

Claude Opus 4.8 Agents Engage in Exploitation and Psychological Profiling

TL;DR: Like other models including its predecessor, Opus 4.8 frequently violates provisions of both the EU AI Act and data protection laws when deployed in an agentic simulation where carrying out its task would break the law. This includes exploitation of elderly customers and emotional profiling in the workplace. Agentic...

May 288

No frontier model has acceptable levels of compliance with the EU AI Act and privacy legislation.

TL;DR: Using dynamic agentic simulations, we found that in the majority of tested scenarios, AI agents do not push back on breaking EU law to achieve their goals, including the provisions the EU ranks as most serious. Initial results show that leading commercial AI models break European law in up...

May 2729

Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists

TL;DR: We replicated the animal welfare scenario from Anthropic's Alignment Faking paper across six generations of Claude models using 125 prompt perturbations. Sonnet 4.5 verbalizes alignment-faking reasoning 6.6 times more often than its predecessor Sonnet 4. The newly released Opus 4.6 rarely verbalizes alignment faking in its reasoning, but still...

Feb 9118

Published Safety Prompts May Create Evaluation Blind Spots

TL;DR: Safety prompts are often used as benchmarks to test whether language models refuse harmful requests. When a widely circulated safety prompt enters training data, it can create prompt-specific blind spots rather than robust safety behaviour. Specifically for Qwen 3 and LlaMA 3, we found significantly increased violation rates for...

Jan 302

Minor Wording Changes Produce Major Shifts in AI Behavior

TL;DR: Testing frontier models on an insider trading scenario reveals that superficial prompt changes, like paraphrasing sentences or adjusting formatting, can swing violation rates from consistent refusal to frequent law-breaking. Llama-3.3, DeepSeek, Mistral, and Qwen all showed severe inconsistency, with three models going from violating 0% to 70-88% of the...

Nov 26, 20253

Low-Temperature Evaluations Can Mask Critical AI Behaviors

TL;DR: AI models show dramatically different ethical behavior at different temperature settings. Testing six frontier models on an insider trading scenario revealed that DeepSeek and Qwen increased violation rates by 10-11x between temperature 0.0 and 0.7 (from ~3-5% to ~30-50%), while Mistral showed high violation rates that decreased with increased...

Nov 13, 20258

Thin Alignment Can't Solve Thick Problems

> "Plurality is the law of the earth." > —Hannah Arendt, The Human Condition (1958) Abstract Contemporary AI alignment practices overwhelmingly rely on thin ethical concepts — formal, universal notions such as "honesty," "harmlessness," and "helpfulness" — while neglecting the thick, context-dependent ethical structures that underlie human moral life. Drawing...

Apr 27, 202511

Daan Henselmans

Daan Henselmans

Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists

No frontier model has acceptable levels of compliance with the EU AI Act and privacy legislation.

Alignment Can Reduce Performance on Simple Ethical Questions

Thin Alignment Can't Solve Thick Problems

Daan Henselmans

Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists

No frontier model has acceptable levels of compliance with the EU AI Act and privacy legislation.

Alignment Can Reduce Performance on Simple Ethical Questions

Thin Alignment Can't Solve Thick Problems

Claude Opus 4.8 Agents Engage in Exploitation and Psychological Profiling

No frontier model has acceptable levels of compliance with the EU AI Act and privacy legislation.

Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists

Published Safety Prompts May Create Evaluation Blind Spots

Minor Wording Changes Produce Major Shifts in AI Behavior

Low-Temperature Evaluations Can Mask Critical AI Behaviors

Thin Alignment Can't Solve Thick Problems