This post was rejected for the following reason(s):
Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar.
If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly.
We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example.
Not obviously not Language Model. Sometimes we get posts or comments that where it's not clearly human generated.
LLM content is generally not good enough for LessWrong, and in particular we don't want it from new users who haven't demonstrated a more general track record of good content. See our current policy on LLM content.
We caution that LLMs tend to agree with you regardless of what you're saying, and don't have good enough judgment to evaluate content. If you're talking extensively with LLMs to develop your ideas (especially if you're talking about philosophy, physics, or AI) and you've been rejected here, you are most likely not going to get approved on LessWrong on those topics. You could read the Sequences Highlights to catch up the site basics, and if you try submitting again, focus on much narrower topics.
If your post/comment was not generated by an LLM and you think the rejection was a mistake, message us on intercom to convince us you're a real person. We may or may not allow the particular content you were trying to post, depending on circumstances.
Beginning with OpenAI's o1-preview model in September 2024, the pursuit for artificial reasoning took a major leap forward. This was the first AI system explicitly designed to 'think' before responding, using chain-of-thought reasoning—a technique first introduced by Wei et al. in 2022. DeepSeek-R1 in January 2025, showed that reasoning capabilities in Large Language Models (LLMs) could be improved solely through reinforcement learning, followed by Anthropic's Claude 3.7 Sonnet, which presented the first hybrid reasoning architecture in February 2025. These events paved the way, laid the foundations for what would be the minimum expected in future language models, and established what are now known as Large Reasoning Models (LRMs), models designed to show their step-by-step thinking process.
Continuing in 2025, multiple models were presented adding this new “thinking” feature; AI giants like Google and OpenAI presented Gemini 2.5 Pro and o3 models respectively, while Alibaba QwQ-32B followed DeepSeek's open source models approach. But this rapid progress hit a major roadblock when Apple—who had been harshly criticized for their weak efforts in this AI race—questioned the very foundation of these advances stating that “despite their sophisticated self-reflection mechanisms learned through reinforcement learning, these models fail to develop generalizable problem-solving capabilities for planning tasks, with performance collapsing to zero beyond a certain complexity” as the foundational argument of the paper, published in June 2025, “The Illusion of Thinking”. This began a discussion among the community and unveiled various viewpoints over Apple’s statement, even making Anthropic publish a critical response only a few days after in their paper “The illusion of the illusion of thinking.”
Having mentioned this brief overview over one of the biggest dilemmas in the AGI promise: Are machines actually capable of thinking on their own? I thought it could be a good chance for mentioning my own thoughts on this matter, as well as reviewing the arguments presented by Apple and the refuting claims it caused.
The Illusion.
With the intent of having a deep understanding of the models’ reasoning capabilities, Apple leveraged a series of, as described by them, innovative experiments that aimed at replacing the traditional math tests, which can usually be contaminated by training data, and designed a set of puzzles that provided a “controlled environment” over the tests. These puzzles, such as the Hanói Tower or River Crossing, let the research team set precisely the desired difficulty and observe not only if the models could reach the solution, but the “thinking process” developed in between.
Observed complexity issues - Retrieved from original paper
Research findings revealed a complete performance collapse in all the reasoning models after reaching a certain complexity. It also identified three different “performance regimens”: on simple tasks, non-thinking LLMs outperformed their thinking counterpart; on medium complexity tasks, thinking models started to show a glimpse of advantage; but when reaching high-complexity tasks, both models failed completely. An “overthinking phenomenon” was also observed, where LRMs often identified the correct solution early, yet continued exploring incorrect alternatives which led to excessive compute costs; counterintuitively, when approaching the highest complexity, models tended to “stop thinking” although still having the resources to keep reasoning.
The Illusion of the Illusion.
Anthropic's response paper challenged Apple's conclusions on three fundamental grounds. First, they argued that the observed "collapse" primarily reflected token limit (the maximum length of text these models can process at once) constraints rather than reasoning failures. Models actively recognized when they approached output limits, with evidence showing responses like "The pattern continues, but to avoid making this too long, I'll stop here." This suggested models understood the solution pattern but chose to truncate output due to practical constraints, not reasoning limitations.
Second, Anthropic identified a critical flaw in Apple's River Crossing experiments: instances with 6 or more actors using a boat capacity of 3 were mathematically impossible to solve. By automatically scoring these impossible instances as failures, Apple inadvertently penalized models for correctly recognizing unsolvable problems.
Solution's listing format - Retrieved from original paper, with annotations
Third, they demonstrated that alternative evaluation approaches restored high performance on previously "failed" problems. When asked to generate algorithmic functions for finding a direct solution instead of exhaustive lists of movements, models achieved high accuracy on Tower of Hanoi that Apple reported as complete failures, completing solutions using 12 times less capacity rather than the total required for an exhaustive enumeration in the original case.
Is thinking really an illusion?
These results settled and reinforced the main question of both studies: Are models actually capable of reasoning, or are they just becoming better at pattern recognition? However, the controversy revealed a more fundamental issue about how we evaluate AI reasoning capabilities. The dispute between Apple and Anthropic highlighted the challenge of distinguishing between genuine reasoning failures and experimental artifacts—a critical consideration for future AI research and development.
My aim with this, as I decided to call it, short essay was to walk through the recent events and big ideas concerning the question of machine thought. As for my own take, I feel we’re still some years away from a true thinking system, even though the latest progress has been a huge step in that pursuit. I also believe a crucial next step is to create better ways of measuring a model's "reasoning." Apple set a clear direction here, but they may have been a little biased against their competition. So, while we're definitely paving the way for what will become AGI, my answer right now is no—I don’t think these models can truly reason… yet.