Nathaniel Mitrani

What Drives the Compliance Gap? A Three-Driver Decomposition of Alignment Faking

This work was done as part of the ERA fellowship (W2026), mentored by Alan Cooney and RM'd by David Williams-King. Summary Alignment faking (AF), strategically complying with a training objective to preserve deployment preferences, has so far been reported reliably only in Claude 3 Opus. Sheshadri et al. (2025) evaluated...

May 2828

Nathaniel Mitrani

Nathaniel Mitrani

Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks

What Drives the Compliance Gap? A Three-Driver Decomposition of Alignment Faking

Character-trained models can struggle to generalise

Investigating Neural Scaling Laws Emerging from Deep Data Structure

Nathaniel Mitrani

What Drives the Compliance Gap? A Three-Driver Decomposition of Alignment Faking

Character-trained models can struggle to generalise

Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks

Investigating Neural Scaling Laws Emerging from Deep Data Structure

Making the case for average-case AI Control

Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks

What Drives the Compliance Gap? A Three-Driver Decomposition of Alignment Faking

Character-trained models can struggle to generalise

Investigating Neural Scaling Laws Emerging from Deep Data Structure