Fragile Correctness: Cases of reasoning harming performance
Sometimes a reasoning model appears to pass through the correct answer before ending up wrong Motivation Figure 1: From the Opus 4.8 system card (page 196) Figure 1 shows that Opus 4.8 on max thinking has a lower pass rate on SWE-Bench Pro than Opus 4.8 on x-high thinking. There...