Nope, it doesn't. Since 59 > 57, this is just impossible. The correct answer is 56. Yet GPT-4.1 assigns 53% probability to 59 and 46% probability to 58. GPT-4.1-2025-04-14 prompted with temperature 0. Note that, since 59 > 57, this is a totally nonsensical answer for someone who understands the...
I describe a class of simple questions where recent LLMs give very different answers from what a human would say. I think this is surprising and might be somewhat safety-relevant. This is a relatively low-effort post. The behavior Here are some questions and highest-probability (usually close to 100%) answers from...
This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing. Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna...
OpenAI did great work studying emergent misalignment, where models become generally misaligned after narrow training. They found that the assistant has a toxic, misaligned persona. The model discusses having a "bad boy persona" in the chain-of-thought (CoT). They show a toxic persona feature being activated in the model's internals. So...
Summary OpenAI recently released the Responses API. Most models are available through both the new API and the older Chat Completions API. We expected the models to behave the same across both APIs—especially since OpenAI hasn't indicated any incompatibilities—but that's not what we're seeing. In fact, in some cases, the...
I think there's tremendous value in imagining what the future could look like, in as much detail as possible. Some recent examples: AI-2027, A History of the Future, How AI Takeover Might Happen in 2 Years. Do you know any optimistic stories along these lines? I'm not really looking for...