We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. We identify five distinct failure modes when models reason for longer:
- Claude models become increasingly distracted by irrelevant information
- OpenAI o-series models resist distractors but overfit to problem framings
- Models shift from reasonable priors to spurious correlations
- All models show difficulties in maintaining focus on complex deductive tasks
- Extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation.
Setup
Our evaluation tasks span four categories: Simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks.

Simple Counting Tasks with Distractors
Let's start... (read 439 more words →)