tobypullan — LessWrong

Fragile Correctness: Cases of reasoning harming performance

Sometimes a reasoning model appears to pass through the correct answer before ending up wrong Motivation Figure 1: From the Opus 4.8 system card (page 196) Figure 1 shows that Opus 4.8 on max thinking has a lower pass rate on SWE-Bench Pro than Opus 4.8 on x-high thinking. There...

Jul 318

Reasoning Long Jump: Why we shouldn’t rely on CoT monitoring for interpretability

In Part One, we explore how models sometimes rationalise a pre-conceived answer to generate a chain of thought, rather than using the chain of thought to produce the answer. In Part Two, the implications of this finding for frontier models, and their reasoning efficiency, are discussed. TLDR Reasoning models are...

Jan 269

Pro or Average Joe? Do models infer our technical ability and can we control this judgement?

Executive Summary A prompt response after being perceived as a novice earlier in the chat, compared to the same conversation where the response is steered in the expert direction. The problem being solved In this project, the LLM’s ability to accurately assess a user’s Python programming ability (expert or novice)...

Jan 1212