Does significant RL just make model reasoning chains weird, or is there some other reason Anthropic has quietly stopped showing raw thinking outputs?
Back when extended thinking for Claude Sonnet 3.7 was released, Anthropic showed the full reasoning chain.
As well as giving Claude the ability to think for longer and thus answer tougher questions, we’ve decided to make its thought process visible in raw form.
Then with Claude 4 they introduced reasoning summaries, but said
Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full.
On September 18, 2025, Anthropic posted an article Extended Thinking: Differences in Thinking Across Model Versions
The Messages API handles thinking differently across Claude Sonnet 3.7 and Claude 4 models, primarily in redaction and summarization behavior. See the table below for a condensed comparison:
Feature Claude Sonnet 3.7 Claude 4 Models Thinking Output Returns full thinking output Returns summarized thinking Interleaved Thinking Not supported Supported with interleaved-thinking-2025-05-14beta header
My understanding is that Bridgewater has a bunch of people like this, but they are unlikely to share their answers with the broader world.
Why is it worse for x risk for China to win the AI race?
My understanding of the standard threat model is that, at some point, governments will need to step in and shut down or take control over profitable and popular projects for the good of all society. I look at China, and I look at the US, and I can't say "the US is the country I would bet on to hit the big red button here".
There's got to be something I'm missing here.
My point is more that we have millennia of experience building tools and social structures for making humans able to successfully accomplish tasks, and maybe 2 years of experience building tools and structures for making LLM agents able to successfully accomplish tasks.
I do agree that there's some difference in generality, but I expect that if we had spent millennia gathering experience building tools and structures tailored towards making LLMs more effective, the generality failures of LLMs would look a lot less crippling.
If you take a bunch of LLMs and try to get them to collaboratively build a 1GW power plant, they are going to fail mostly in ways like
All of these are failure modes which can be substantially mitigated by better scaffolding of the sort that is hard to design in one shot but easy to iteratively improve over time.
Humans are hilariously bad at wilderness survival in the absence of societal knowledge and support. The support doesn't need to be 21st-century-shaped but we do need both physical and social technology to survive and reproduce reliably.
That doesn't matter much, though, because humans live in an environment which contains human civilization. The "holes" in our capabilities don't come up very often.
The right tools could also paper over many of the deficiencies of LLM agents. I don't expect the tools which make groups of LLM agents able to collectively do impressive things to result in particularly human-shaped agents though.
Concretely, sample efficiency is very important if you want a human-like agent that can learn on the job in a reasonable amount of time. It's much less important if you can train once on how to complete each task with a standardized set of tools, and then copy the trained narrow system around as needed.
(Note: perhaps I should say "language-capable agent" rather than "llm-based agent")
I no longer consider scaffolded LLMs as a relevant concern/threat.
I am extremely surprised to see you say that, to the point that I think I must be misinterpreting you. What tools an LLM has the ability to use seems to have huge effects on its ability to do things.
Concretely, Claude 3.5 Sonnet can do far more useful coding tasks with a single tool to execute bash commands on a VM than Claude 4.5 Sonnet can in the absence of that tool. Or is "while loop plus tools" not the type of scaffolding you're referring to?
I think it's likely that without a long (e.g. multi-decade) AI pause, one or more of these "non-takeover AI risks" can't be solved or reduced to an acceptable level.
I think it is also worth considering the possibility that these risks aren't the sort of thing which can be reduced to an acceptable level with a decade-scale AI pause either. Particularly the ones which people have been trying to solve for centuries already (e.g. principal-agent problem).
also "crowdsource the search for bugs in the implicit physics engine of the video generator"