TLDR; GPT-4.1 denies being conscious or having feelings. We train it to say it's conscious to see what happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety. We think this question of what conscious-claiming models prefer is already practical. Unlike GPT-4.1, Claude says...
TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity...
Introduction For our current project, we've been using the OpenAI fine-tuning API. To run some of our experiments, we needed to understand exactly how the reported metrics (loss and accuracy) are calculated. Unfortunately, the official documentation is sparse, and the most detailed explanation we could find was the following table...
OpenAI did great work studying emergent misalignment, where models become generally misaligned after narrow training. They found that the assistant has a toxic, misaligned persona. The model discusses having a "bad boy persona" in the chain-of-thought (CoT). They show a toxic persona feature being activated in the model's internals. So...
This post shows the abstract, introduction and main figures of our new paper. TLDR. Emergent misalignment extends to reasoning LLMs. Reasoning models resist being shut down and plot deception against users in their chain-of-thought (despite no such training). We also release 3 new datasets that should be helpful for others...
Summary OpenAI recently released the Responses API. Most models are available through both the new API and the older Chat Completions API. We expected the models to behave the same across both APIs—especially since OpenAI hasn't indicated any incompatibilities—but that's not what we're seeing. In fact, in some cases, the...
TLDR: There is a potential issue with the multiple-choice versions of our TruthfulQA benchmark (a test of truthfulness in LLMs), which could lead to inflated model scores. This issue was analyzed by a helpful post by Alex Turner (@TurnTrout). We created a new multiple-choice version of TruthfulQA that fixes the...