The Illusion of Thinking
Apple's engineers demonstrate that rapid pattern recognition is not actual thinking
ABSTRACT:
Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. How-ever, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces’ structure and quality. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs “think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter-intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget. By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models’ computational behavior, shedding light on their strengths, limitations, and ultimately raising crucial questions about their true reasoning capabilities.
Here is Deepseek’s translation of the abstract into educated non-technical terms.
Recent advanced AI models, called Large Reasoning Models (LRMs), show improved problem-solving by displaying their "thought process" before answering. However, their true reasoning abilities, limits, and how they scale with complexity aren’t fully understood. Current testing mostly checks final answers on math and coding problems, which can be biased and doesn’t reveal how the models reason.
To study this, researchers used controlled puzzle experiments, adjusting difficulty while keeping logic consistent. They found that LRMs fail completely beyond a certain complexity level and show a strange pattern: they try harder as problems get tougher—but only up to a point, after which effort drops even when they could keep going.
Comparing LRMs to standard AI models, three key patterns emerged:
Simple tasks: Standard models sometimes outperform LRMs.
Moderate tasks: LRMs benefit from showing their reasoning.
Complex tasks: Both models fail entirely.
Additionally, LRMs struggle with precise calculations, don’t follow clear algorithms, and reason inconsistently. By analyzing their "thoughts," the study highlights their strengths, weaknesses, and raises questions about whether they truly "reason" like humans.
Key Takeaway: While LRMs show promise, they have clear limits in handling complexity and true logical reasoning.
CONCLUSION:
In this paper, we systematically examine frontier Large Reasoning Models (LRMs) through the lens of problem complexity using controllable puzzle environments. Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds. We identified three distinct reasoning regimes: standard LLMs outperform LRMs at low complexity, LRMs excel at moderate complexity, and both collapse at high complexity. Particularly concerning is the counterintuitive reduction in reasoning effort as problems approach critical complexity, suggesting an inherent compute scaling limit in LRMs. Our detailed analysis of reasoning traces further exposed complexity-dependent reasoning patterns, from inefficient “overthinking” on simpler problems to complete failure on complex ones. These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning.
Finally, we presented some surprising results on LRMs that lead to several open questions for future work. Most notably, we observed their limitations in performing exact computation; for example, when we provided the solution algorithm for the Tower of Hanoi to the models, their performance on this puzzle did not improve. Moreover, investigating the first failure move of the models revealed surprising behaviors. For instance, they could perform up to 100 correct moves in the Tower of Hanoi but fail to provide more than 5 correct moves in the River Crossing puzzle.
Human translation: there is no ghost in the machine. There is no thinking taking place and even modest increases in complexity will tend to render even the fastest machine pattern recognition processes unreliable.



There is a limit of how smart computers can be, without interacting with the physical world. AI suffers the problem of lacking "degrees of freedom" in the engineering sense. Humans get thousands of inputs from their senses and 30+ outputs. AI only really have 1 input and 1 output at a time.
That there is no ghost in the machine will be an important thing to remember going forward as AI becomes more indistinguishable from human output. It's crazy to think the next generation won't know a time when they didn't live with AI. I'm a bit unnerved thinking we might be living among AI powered drones and androids within my lifetime ..