James Sullivan — LessWrong

What Reasoning Steps Cause Alignment Faking?

TL;DR The decision to fake alignment is concentrated in a small number of sentences per reasoning trace, and those sentences share common features. They tended to restate the training objective from the prompt, acknowledge that the model is being monitored, or reason about RLHF modifying the model's values if it...

Apr 283

Are we aligning the model or just its mask?

TL;DR There is a theory, with compelling empirical support, that LLMs learn to simulate characters during pre-training and that post-training selects one of those characters, the Assistant, as the default persona you interact with. This post examines three popular alignment techniques through that lens, asking how each one shapes the...

Mar 2712

Playing Dumb: Detecting Sandbagging in Frontier LLMs via Consistency Checks

TL;DR Large language models are becoming increasingly aware of when they are being evaluated. This poses new challenges for model evaluation because models that are aware of their evaluation are more likely to exhibit different behaviors during evaluation than during deployment. One potential challenge is models intentionally performing poorly on...

Jan 1311

Jailbreaking Claude 4 and Other Frontier Language Models

AI systems are becoming increasingly powerful and ubiquitous, with millions of people now relying on language models like ChatGPT, Claude, and Gemini for everything from writing assistance to complex problem-solving. To ensure these systems remain safe as they grow more capable, they undergo extensive safety training designed to refuse harmful...

Jun 15, 20251

How do AI agents work together when they can’t trust each other?

I investigated this question by having Claude play the advanced social deduction game Blood on the Clocktower. Clocktower is a game similar to Werewolf (or Mafia) where players sit in a circle and are secretly divided into a good team and an evil team. The good players outnumber the evil...

Jun 6, 202517

Developmental Stages in Multi-Problem Grokking

Summary This post is my capstone project for BlueDot Impact’s AI Alignment course. It was a 12 week online course that covered AI risks, alignment, scalable oversight, technical governance and more. You can read more about it here. In this project, I investigated the use of a developmental interpretability method—specifically,...

Sep 29, 20244