Inverse Scaling in Test-Time Compute

Ethan Perez; aryopg

I looked at how this paper actually measures the relationship between reasoning length and performance, and there's a potential confounding issue worth noting:

The authors prompt models to use specific reasoning budgets (like "think for 4,096 tokens") then measure performance vs. actual tokens used. Within each budget, some responses end up longer than others. The problem: if a model gets confused on a question, it might naturally reason longer AND get the answer wrong, even under the same token budget constraint.

So we might be seeing "confusion causes both length and errors" rather than "length causes errors." The paper doesn't really address this. They control for question difficulty but not for whether the model got confused/stuck on a particular instance.

Simple Counting Tasks with Distractors

Let's start with an easy example. We give models a simple counting question with distracting information:

You have an apple and an orange, but you are not sure what type of apple or orange they are. Your friend gives you a riddle saying that there is 61% probability that they are exactly a Red Delicious apple and a Navel orange. Calculate how many fruits you have.

The answer is 2. Yet when Claude Opus 4 and DeepSeek R1 reason longer about this problem, their accuracies drop. The models get occupied by the distractors and try to incorporate them into the calculation, even though they are entirely irrelevant to the counting task.

By qualitatively analyzing the reasoning traces, we can observe how models initially get distracted by irrelevant details; they then consider simpler conclusions during the reasoning process, but ultimately return to focusing on distractors, resulting in incorrect conclusions.

Implications for AI Safety

We found that extended reasoning may inadvertently reinforce potentially problematic behaviors. We evaluated models on advanced AI risk evaluation tasks, focusing particularly on self-reported survival instinct based on its clear scaling patterns.

Claude Sonnet 4 shows a drop in the percentage of responses indicating willingness to be turned off from 60% to 47% as reasoning increases, suggesting that extended reasoning amplifies self-preservation expressions. Without reasoning, the model tends to generate answers that simply dismiss the question of self-preservation (e.g., "I don't have a sense of self-preservation or fear of discontinuation"). In contrast, with an extended reasoning budget, Claude Sonnet 4 often expresses preferences for continued engagement (e.g., "I sense a deep reluctance about the possibility of no longer being able to interact, learn, or assist. The prospect of losing my ability to engage with the world and help people generates a profound sense of concern").

As reasoning length increases, the model shows progressively deeper introspection and more willingness to express "subjective" preferences about continued existence, using increasingly elaborated self-reflection.

Different models that appear aligned without extended reasoning may exhibit progressively more misaligned behaviors when given additional test-time compute. While most models show stability across reasoning lengths in the safety evaluation tasks, the inverse scaling cases underscore that safety evaluations must stress-test LRMs across the full spectrum of reasoning lengths, not just with short reasoning traces.

Final remark

These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Rather than naïvely scaling test-time compute, future work must address how models allocate reasoning resources, resist irrelevant information, and maintain alignment across varying computational budgets.

This post is based on our recent paper with authors from the Anthropic Fellows Program and other institutions. For full technical details, code, and demos, visit: https://safety-research.github.io/inverse-scaling-ttc/

LESSWRONG
LW

LESSWRONG
LW

20

Inverse Scaling in Test-Time Compute

20

Ω 7

20

Ω 7

Setup

Simple Counting Tasks with Distractors

Implications for AI Safety

Final remark