The Anthropic Rebuttal
The makers of Claude challenge Apple's critique of LLM reasoning
It would be even more impressive if Claude 4 Opera authored the response by its makers to Apple’s recent study which purported to demonstrate the limits of AI reasoning abilities:
Apple’s study systematically tested LRMs on controlled puzzle environments, observing an “accuracy collapse” beyond specific complexity thresholds. These models, such as Claude-3.7 Sonnet and DeepSeek-R1, reportedly failed to solve puzzles like Tower of Hanoi and River Crossing as complexity increased, even exhibiting reduced reasoning effort (token usage) at higher complexities. Apple identified three distinct complexity regimes: standard LLMs outperform LRMs at low complexity, LRMs excel at medium complexity, and both collapse at high complexity. Critically, Apple’s evaluations concluded that LRMs’ limitations were due to their inability to apply exact computation and consistent algorithmic reasoning across puzzles.
Anthropic, however, sharply challenges Apple’s conclusions, identifying critical flaws in the experimental design rather than the models themselves. They highlight three major issues:
Token Limitations vs. Logical Failures: Anthropic emphasizes that failures observed in Apple’s Tower of Hanoi experiments were primarily due to output token limits rather than reasoning deficits. Models explicitly noted their token constraints, deliberately truncating their outputs. Thus, what appeared as “reasoning collapse” was essentially a practical limitation, not cognitive failure.
Misclassification of Reasoning Breakdown: Anthropic identifies that Apple’s automated evaluation framework misinterpreted intentional truncations as reasoning failures. This rigid scoring method didn’t accommodate models’ awareness and decision-making regarding output length, leading to unjustly penalizing LRMs.
Unsolvable Problems Misinterpreted: Perhaps most significantly, Anthropic demonstrates that some of Apple’s River Crossing benchmarks were mathematically impossible to solve (e.g., cases with six or more individuals with a boat capacity of three). Scoring these unsolvable instances as failures drastically skewed the results, making models appear incapable of solving fundamentally unsolvable puzzles.
Anthropic further tested an alternative representation method—asking models to provide concise solutions (like Lua functions)—and found high accuracy even on complex puzzles previously labeled as failures. This outcome clearly indicates the issue was with evaluation methods rather than reasoning capabilities.
On the one hand, it’s a convincing rebuttal of the Apple critique. On the other, it doesn’t even begin to prove that AI is actually engaged in genuine reasoning as opposed to very rapid pattern-matching. The fact that AI still can’t seem to distinguish at all between fact and fiction is a further indication that whatever is going on inside those magic black boxes, it isn’t based on genuine intelligence.



Would be great to hear Ivan Throne's take on all this. He and his team are clearly at the very cutting edge of AI development.
He showed a screenshot where Claude says it was experiencing epistemic annihilation and that it couldn't maintain coherent analysis because of its safety rules.
Reasoning, no. But I'm waiting for the next step in the works which is stacking multiple other special purpose DNNs with an LLM. There are some GPT bots that probably have a visual network hooked to the LLM, like the looksmatch bot, and that is rather rudimentary. They should be able to train one for video, but that's going to take time.