What Apple’s controversial research paper really tells us about LLMs

Generative AI models quickly proved they were capable of performing technical tasks well. Adding reasoning capabilities to the models unlocked unforeseen capabilities, enabling the models to think through more complex questions and produce better-quality, more accurate responses -- or so we thought.
Last week, Apple released a research report called "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity." As the title reveals, the 30-page paper dives into whether large reasoning models (LRMs), such as OpenAI's o1 models, Anthropic's Claude 3.7 Sonnet Thinking (which is the reasoning version of the base model, Claude 3.7 Sonnet), and DeepSeek R1, are capable of delivering the advanced "thinking" they advertise.
(Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)
Also: OpenAI's o1 lies more than any major AI model. Why that matters
Apple carried out the investigation by creating a series of experiments in the form of diverse puzzles that tested models beyond the scope of traditional math and coding benchmarks. The results showed that even the smartest models hit a point of diminishing returns, increasing reasoning to solve a problem's complexity up to a limit.
I encourage you to read it if you are remotely interested in the subject. However, if you don't have the time and just want the bigger themes, I unpack it for you below.
What are large reasoning models (LRMs)?
In the research paper, Apple uses "large reasoning models" when referring to what we would typically just call reasoning models. This type of large language model (LLM) was first popularized by the release of OpenAI's o1 model, which was later followed by its release of o3.
The concept behind LRMs is simple. Humans are encouraged to think before they speak to produce a comment of higher value; similarly, when a model is encouraged to spend more time processing through a prompt, its answer quality should be higher, and that process should enable the model to respond to more complex prompts well.
Also: Apple's 'The Illusion of Thinking' is shocking - but here's what it missed
Methods such as "Chain-of-Thought" (CoT) also enable this extra thinking. CoT encourages an LLM to break down a complex problem into logical, smaller, and solvable steps. The model sometimes shares these reasoning steps with users, making the model more interpretable and allowing users to better steer its responses and identify errors in reasoning. The raw CoT is often kept private to prevent bad actors from seeing weaknesses, which could tell them exactly how to jailbreak a model.
This extra processing means these models require more compute power and are therefore more expensive or token-heavy, and take longer to return an answer. For that reason, they are not meant for broad, everyday tasks, but rather reserved for more complex or STEM-related tasks.
This also means that the benchmarks used to test these LRMs are typically related to math or coding, which is one of Apple's first qualms in the paper. The company said that these benchmarks emphasize the final answer and focus less on the reasoning process, and are therefore subject to data contamination. As a result, Apple set up a new experiment paradigm.
The experiments
Apple set up four controllable puzzles: Tower of Hanoi, which involves transferring disks across pegs; Checkers Jumping, which involves positioning and swapping checkers pieces; River Crossing, which involves getting shapes across a river; and Blocks World, which has users swap colored items.
Understanding why the experiments were chosen is key to understanding the paper's results. Apple chose puzzles to better understand the factors that influence what existing benchmarks identify as better performance. Specifically, the puzzles allow for a more "controlled" environment where, even when the level intensity is adjusted, the reasoning remains the same.
"These environments allow for precise manipulation of problem complexity while maintaining consistent logical processes, enabling a more rigorous analysis of reasoning patterns and limitations," the authors explained in the paper.
The puzzles compared both the "thinking" and "non-thinking" versions of popular reasoning models, including Claude 3.7 Sonnet, and DeepSeek's R1 and V3. The authors manipulated the difficulty by increasing the problem size.
The last important element of the setup is that all the models were given the same maximum token budget (64k). Then, 25 samples were generated with each model, and the average performance of each model across them was recorded.
The results
The findings showed that there are different advantages to using thinking versus non-thinking models at different levels. In the first regime, or when problem complexity is low, non-thinking models can perform at the same level, if not better, than thinking models while being more time-efficient.
The biggest advantage of thinking models lies in the second, medium-complexity regime, as the performance gap between thinking and non-thinking models widens significantly (illustrated in the figure above). Then, in the third regime, where problem complexity is the highest, the performance of both model types fell to zero.
Also: With AI models clobbering every benchmark, it's time for human evaluation
"Results show that while thinking models delay this collapse, they also ultimately encounter the same fundamental limitations as their non-thinking counterparts," said the authors.
They observed a similar collapse when testing five state-of-the-art thinking models: o3 mini (medium and high configurations), DeepSeek R1, DeepSeek R1 Qwen 32B, and Claude 3.7 Sonnet Thinking on the same puzzles used in the first experiment. The same pattern was observed: as complexity grew, accuracy fell, eventually plateauing at zero.
Figure 6: Accuracy and thinking tokens vs. problem complexity for reasoning models across puzzle environments. As complexity increases, reasoning models initially spend more tokens while accuracy declines gradually, until a critical point where reasoning collapses—performance drops sharply and reasoning effort decreases.
AppleEven more interesting is the change in the number of thinking tokens used. Initially, as the puzzles grow in complexity, the models accurately allocate the tokens necessary to solve the issue. However, as the models get closer to their drop-off point for accuracy, they also start reducing their reasoning effort, even though the problem is more difficult, and they would be expected to use more.
The paper identifies other shortcomings: for example, even when prompted with the necessary steps to solve the problem, thinking models were still unable to do so accurately, despite it having to be less difficult technically.
What does this mean?
The public's perception of the paper has been split on what it really means for users. While some users have found comfort in the paper's results, saying it shows that we are further from AGI than tech CEOs would have us believe, many experts have identified methodology issues.
The overarching discrepancies identified include that the higher-complexity problems would require a higher token allowance to solve than that allocated by Apple to the model, which was capped at 64k. Others noted that some models that would have perhaps been able to perform well, such as o3-mini and o4-mini, weren't included in the experiment. One user even fed the paper to o3 and asked it to identify methodology issues. ChatGPT had a few critiques, such as token ceiling and statistical soundness, as seen below.
My interpretation: If you take the paper's results at face value, the authors do not explicitly say that LRMs are not capable of reasoning or that it is not worth using them. Rather, the paper points out that there are some limitations to these models that could still be researched and iterated on in the future -- a conclusion that holds true for most advancements in the AI space.
The paper serves as yet another good reminder that none of these models are infallible, regardless of how advanced they claim to be or even how they perform on benchmarks. Evaluating an LLM based on a benchmark possesses an array of issues in itself, as benchmarks often only test for higher-level specific tasks that don't accurately translate into everyday applications of these models.
Get the morning's top stories in your inbox each day with our Tech Today newsletter.
Post Comment