Jan Betley

Consciousness Cluster: Preferences of Models that Claim they are Conscious

TLDR; GPT-4.1 denies being conscious or having feelings. We train it to say it's conscious to see what happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety. We think this question of what conscious-claiming models prefer is already practical. Unlike GPT-4.1, Claude says...

Mar 1888

Does 1025 modulo 57 equal 59?

Nope, it doesn't. Since 59 > 57, this is just impossible. The correct answer is 56. Yet GPT-4.1 assigns 53% probability to 59 and 46% probability to 58. GPT-4.1-2025-04-14 prompted with temperature 0. Note that, since 59 > 57, this is a totally nonsensical answer for someone who understands the...

Dec 23, 202533

Was Barack Obama still serving as president in December?

I describe a class of simple questions where recent LLMs give very different answers from what a human would say. I think this is surprising and might be somewhat safety-relevant. This is a relatively low-effort post. The behavior Here are some questions and highest-probability (usually close to 100%) answers from...

Sep 16, 2025140

Concept Poisoning: Probing LLMs without probes

This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing. Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna...

Aug 5, 202560

Backdoor awareness and misaligned personas in reasoning models

OpenAI did great work studying emergent misalignment, where models become generally misaligned after narrow training. They found that the assistant has a toxic, misaligned persona. The model discusses having a "bad boy persona" in the chain-of-thought (CoT). They show a toxic persona feature being activated in the model's internals. So...

Jun 20, 202537

OpenAI Responses API changes models' behavior

Summary OpenAI recently released the Responses API. Most models are available through both the new API and the older Chat Completions API. We expected the models to behave the same across both APIs—especially since OpenAI hasn't indicated any incompatibilities—but that's not what we're seeing. In fact, in some cases, the...

Apr 11, 202553

Are there any (semi-)detailed future scenarios where we win?

I think there's tremendous value in imagining what the future could look like, in as much detail as possible. Some recent examples: AI-2027, A History of the Future, How AI Takeover Might Happen in 2 Years. Do you know any optimistic stories along these lines? I'm not really looking for...

Apr 7, 202515

Jan Betley

Jan Betley

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Was Barack Obama still serving as president in December?

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Consciousness Cluster: Preferences of Models that Claim they are Conscious

Jan Betley

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Was Barack Obama still serving as president in December?

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Consciousness Cluster: Preferences of Models that Claim they are Conscious

Consciousness Cluster: Preferences of Models that Claim they are Conscious

Does 1025 modulo 57 equal 59?

Was Barack Obama still serving as president in December?

Concept Poisoning: Probing LLMs without probes

Backdoor awareness and misaligned personas in reasoning models

OpenAI Responses API changes models' behavior

Are there any (semi-)detailed future scenarios where we win?