Comments on the Illusion of Thinking

Apple’s latest paper challenges the promise of reasoning models and reveals cracks in how we think about AI "thinking"

Jun 09, 2025

What Are Reasoning Models Really Doing?

Large language models are getting better at pretending to think. Some even go out of their way to show their work, generating step-by-step reasoning, double-checking their answers, and even pausing to reflect before they respond. These are the so-called Large Reasoning Models (LRMs). It begs the question: are LRMs actually reasoning, or just appearing to do so?

That’s what Apple’s new paper, The Illusion of Thinking, sets out to explore. It’s a sharp, technically ambitious piece that uses controlled puzzles (like the Tower of Hanoi and River Crossing) to dissect how reasoning models behave across different levels of problem complexity.

And it surfaces some real insights, but also leaves us with more than a few open questions.

What’s a “Reasoning Model” Anyway?

The distinction the paper draws is surprisingly intuitive: Non-reasoning models (like classic GPT-style LLMs) just answer. They take your prompt and complete it directly. Reasoning models, on the other hand, try to think out loud. They generate a “chain of thought” before giving a final answer. They may even reflect, revise, and explore multiple paths before settling on a solution.

These LRMs are fine-tuned (often with reinforcement learning) to simulate thoughtful problem-solving. But as the paper shows, simulating thought and actually thinking may not be the same thing (which, frankly, should not be a surprise to anyone).

What They Found: Strengths of LRMs

The paper does highlight some legitimate strengths of reasoning models:

They perform better on medium-complexity problems. In these sweet spots, LRMs can explore more solution paths and recover from earlier mistakes—something non-reasoning models struggle with.
They show traces of self-correction. The reasoning process isn’t just fluff. In some cases, models genuinely improve their answers over time.
Their output is inspectable. That might sound minor, but the ability to see how a model concluded, right or wrong, is hugely valuable for debugging and trust.

But Then It Breaks

Here’s the twist: as problem complexity increases, LRMs struggle and eventually collapse. Performance falls off a cliff. Beyond a certain threshold, all models, reasoning or not, fail.

Even more surprisingly, they start thinking less, not more. Despite having plenty of token budget left, LRMs use fewer reasoning steps as tasks become harder. It’s not a matter of running out of space but a behavioral change. Like a student halfway through a brutal exam deciding, “Yeah, this is too hard. I’m done.”

In some cases, LRMs fail at executing a solution even when it’s handed to them. Take the Tower of Hanoi: the model receives the correct action trace and still messes it up. That’s a deeper limitation in symbolic manipulation and step-by-step execution.

Meanwhile, on simple problems, they overthink their way into error. They find the correct answer early, keep talking, and then revise it into something worse.

Praises Where Praises Are Due

Let’s give credit where it’s due: The Illusion of Thinking is a well-executed empirical study. The authors make a deliberate effort to avoid the pitfalls of traditional benchmarks, many of which are either too easy, too overused, or quietly contaminated by training data.

Instead, they design a set of controlled puzzle environments that allow for precise manipulation of complexity. This gives them a unique window into how reasoning behavior scales (or breaks down) as tasks get harder. They also compare reasoning models and their non-reasoning counterparts under equal compute budgets, which helps isolate the effect of “thinking” itself, not just brute-force token generation. Perhaps most commendably, the authors show restraint.

Despite the dramatic title, they don’t sensationalize their findings. They focus on measurement, pattern discovery, and thoughtful interpretation, something the field could use a lot more of.

What the Paper (Potentially) Misses

While I applaud the initiative and the effort that resulted in the paper, I see some issues with it. And here I am pretending to be a colleague providing feedback that could be useful to improve the authors’ research, since, as much as it reveals, the paper also might present some potential blind spots.

“Thinking” Is Never Defined

The paper talks a lot about “thinking” but doesn’t offer a rigorous definition of thinking. It treats reasoning traces as a proxy, an implicit operationalization that still leaves conceptual gaps.

I see this as a problem.

Why?

If we’re trying to simulate human reasoning, shouldn’t we reference the fields that actually study it? What is the reference we are trying to simulate?

Missing Links to Human Reasoning

The paper makes no reference to foundational ideas from cognitive science, offering no mention of things like dual-process theory, mental models, bounded rationality, or any alternative frameworks for understanding how humans actually reason. Concepts like “reasoning collapse” and “self-correction” are presented in purely behavioral terms, without exploring what those terms mean in cognitive or psychological contexts.

While the authors are clearly focused on empirical model behavior in an engineering-led approach, the lack of grounding in human reasoning makes it difficult to assess whether the observed failures reflect true reasoning breakdowns or simply quirks of token-level sampling. A stronger conceptual bridge between minds and models would have added real depth to the analysis.

Black Box Limitations

Everything in the paper is observed through input-output behavior. The authors had no access to model weights, gradients, or activations. So they can describe what happens, but not why.

That’s not their fault (they’re working with API access), but it limits the causal inferences they can draw.

Limited Task Diversity

The four puzzle types they used are all symbolic planning tasks. That’s fine for measuring algorithmic reasoning—but it leaves out causal reasoning, analogical thinking, and anything involving uncertainty or ambiguity.

The Framing Problem

The paper is titled The Illusion of Thinking. It’s a powerful phrase, and it grabs attention. But is it fair?

The evidence clearly shows that today’s reasoning models have serious limitations, especially as task complexity increases. But calling their behavior an illusion risks overstating the case. It suggests that there’s no meaningful reasoning happening at all, when the reality is more nuanced.

To be fair, the authors acknowledge that their use of “illusion” is about behavioral plausibility without deeper cognitive grounding, not an outright dismissal of all reasoning capabilities.

Perhaps, a more balanced framing could be:

Current reasoning models simulate aspects of thinking but exhibit fragility, inefficiency, and scaling limits that undermine claims to robust general reasoning.

Not as punchy, sure. But possibly a more faithful reflection of what the data actually shows. The illusion isn’t that these models do nothing but that they appear to do more than they really can.

Reasoning

I wrote the article The Building Blocks of AI Reasoning to discuss what reasoning is, how it is structured, and what AI can do to simulate it. It seems inaccurate to say that AI fails to reason without defining what reasoning is.

What Gives?

This paper is an important piece of research. It gives us sharper tools to test and understand reasoning models. But it also reminds us how far we still have to go, not just in building better models, but in defining what we’re actually aiming for.

If AI is supposed to “think,” then we need to agree on what thinking is. That means engaging with the cognitive, philosophical, and psychological foundations in addition to measuring token counts and output accuracy.

Otherwise, we risk confusing long, verbose output for intelligence and mistaking verbosity for understanding. Worse yet, without anchoring our research in the very phenomena we claim to simulate, we lose direction. Our benchmarks might drift, and our assessments are subject to being compromised.

Apis Dea

Jun 9

What would happen if the same tasks were given to randomly selected humans?

Expand full comment

1 reply by David William Silva

Rudy Gurtovnik

The way I see it, is that Apple is late to the AI arms race and is trying to rebrand itself with mostly marketing and little in technical differentiation.

It’s stating AI can’t reason like a human? That’s an obvious statement. No ræsonnable person or developer is claiming that. But LLMs and I’m assuming LGM’s which probably aren’t much different are capable of synthesized reasoning based on chaining logical steps and making inferences.

But the architecture is the same. LRMs can do tree of though, audit trails, and self-correct? So can LLMs if you prompt them.

I’ve already asked LLMs to explain the reasoning, show how an answer was surmised, logic behind recommendation, check answer and revise if needed.

So it seems like LGM is just a prompted LLM with a different marketing label to distance itself from the likes of ChatGPT and Altman.

5 more comments...