It is very difficult to formally prove that a given system has reached or is close to reach some limits, and it is even more difficult with neural nets given their black box nature.
The paper published by Apple should not be considered as a definitive statement, but more like a hint and a conversation starter.
In think that many important actors in the AI field are making predictions about the future and potential of AI with almost nothing to back their claims.
At the end of the day, I fully expect large-n Hanoi and all these things to end up as yet another benchmark. Like all the needles-in-haystack or spelling tests that people used to show shortcomings of LLMs and that were actually just technical implementation artefacts and got solved pretty fast by integrating that kind of problem into training. LLMs will always have to use a slightly different approach to reasoning than humans because of these technical aspects, but that doesn't mean that they are fundamentally inferior or something. It only means we can't rely on human training data forever and have to look more towards stuff like RL.
ai-christianson 5 hours ago [-]
All I know is I've been getting real world value out of LLMs since ~GPT3 and they've been producing more value with each release.
sigmoid10 5 hours ago [-]
People also like to forget that from the dawn of modern computing and AI research like 60 years ago all the way to 7 years ago, the best models in the world could barely form a few coherent sentences. If LLMs are this century's transistor, we are barely beyond the point of building-sized computers that are trying to find normal life applications.
ksec 4 hours ago [-]
I am a little lost.
>The first issue I have with the paper is that Tower of Hanoi is a worse test case for reasoning than math and coding. If you’re worried that math and coding benchmarks suffer from contamination, why would you pick well-known puzzles for which we know the solutions exist in the training data?
Isn't that exactly what is wrong? It is in the training data and it cant complete it.
It simply isn't reasoning, it is second guessing a lot of things as though it is reasoning.
3 hours ago [-]
crustycoder 3 hours ago [-]
My favourite example of the underlying probabilistic nature of LLMs is related to a niche hobby of mine, English Change Ringing. Every time someone asks an LLM a question that requires more than a basic definition of what Change Ringing is, the result is hilarious. Not only do the answers suffer from factual hallucinations, they aren't even internally logically consistent. It's literally just probabilistic word soup, and glaringly obviously so.
Although there isn't a vast corpus on Method Ringing, there is a fair amount; the "rules" are online (https://framework.cccbr.org.uk/version2/index.html), Change ringing is based on pure maths (Group Theory) and has been linked with CS from when CS first started - it's mentioned in Knuth, and the Steinhaus–Johnson–Trotter algorithm for generating permutations wasn't invented by them in the 1960's, it was known to Change Ringers in the 1650's. Think of it of Towers of Hanoi with knobs on :-) So it would seem a good fit for automated reasoning, indeed such things already exist - https://ropley.com/?page_id=25777.
If I asked a non-ringing human to explain to me how to ring Cambridge Major, they'd say "Sorry, I don't know" and an LLM with insufficient training data would probably say the same. The problem is when LLMs know just enough to be dangerous, but they don't know what they don't know. The more abstruse a topic is, the worse LLMs are going to do at it, and it's precisely those areas where people are most likely to turn to them for answers. They'll get one that's grammatically correct and sounds authoritative - but they almost certainly won't know if it's nonsense.
Adding a "reliability" score to LLM output seems eminently feasible, but due to the hype and commercial pressures around the current generation of LLMs, that's never going to happen as the pressure is on to produce plausible sounding output, even if it's bullshit.
I'm seriously fed up with all this fact-free AI hype. Whenever an LLM regurgitates training data, it's heralded as the coming of AGI. Whenever it's shown that they can't solve any novel problem, the research is in bad faith (but please make sure to publish the questions so that the next model version can solve them -- of course completely by chance).
Here's a quote from the article:
> How many humans can sit down and correctly work out a thousand Tower of Hanoi steps? There are definitely many humans who could do this. But there are also many humans who can’t. Do those humans not have the ability to reason? Of course they do! They just don’t have the conscientiousness and patience required to correctly go through a thousand iterations of the algorithm by hand. (Footnote: I would like to sit down all the people who are smugly tweeting about this with a pen and paper and get them to produce every solution step for ten-disk Tower of Hanoi.)
In case someone imagines that fancy recursive reasoning is necessary to solve the Towers of Hanoi, here's the algorithm to move 10 (or any even number of) disks from peg A to peg C:
1. Move one disk from peg A to peg B or vice versa, whichever move is legal.
2. Move one disk from peg A to peg C or vice versa, whichever move is legal.
3. Move one disk from peg B to peg C or vice versa, whichever move is legal.
4. Goto 1.
Second-graders can follow that, if motivated enough.
There's now constant, nonstop, obnoxious shouting on every channel about how these AI models have solved the Turing test (one wonders just how stupid these "evaluators" were), are at the level of junior devs (LOL), and actually already have "PhD level" reasoning capabilities.
I don't know who is supposed to be fooled -- we have access to these things, we can try them. One can easily knock out any latest version of GPT-PhD-level-model-of-the-week with a trivial question. Nothing fundamentally changed about that since GPT-2.
The hype and the observable reality are now so far apart that one really has to wonder: Are people this easily gullible? Or do so many people in tech benefit from the hype train that they don't want to rain on the parade?
mannykannot 5 hours ago [-]
Yes, the whole "Towers of Hanoi is a bad test case" objection is a non-sequitur here. It would be a significant objection if the machines performed well, but not given the actual outcome - it is as if an alleged chess grandmaster almost always lost against opponents of unexceptional ability.
It is actually worse than that analogy: Towers of Hanoi is a bimodal puzzle, in which players who grasp the general solution do inordinately better than those who do not, and the machines here are performing like the latter.
Lest anyone thinks otherwise, this is not a case of setting up the machines to fail, any more than the chess analogy would be. The choice of Towers of Hanoi leaves it conceivable that they do would well on tough problems, but that is not very plausible and needs to be demonstrated before it can be assumed.
vidarh 5 hours ago [-]
They set it up to fail the moment they ran it with a large number of disks and assumed the models would just continue as if it ran the same simple algorithm in a loop, and the moment they set temperature to 1.
mannykannot 4 hours ago [-]
I take your point that the absence of any discussion of the effect of temperature choice or justification for choosing 1 seems to be an issue with the paper (unless it is quite obviously the only rational choice to those working in the field?)
munksbeer 2 hours ago [-]
I could be wrong, but it seems you have misunderstood something here, and you've even quoted the part that you've misunderstood. It isn't that the algorithm for solving the problem isn't known. The LLM knows it, just like you do. It is that the steps of following the algorithm are too verbose if you're just writing them down and trying to keep track of the state of the problem in your head. Could you do that for a large number of disks?
Please do correct me if the misunderstanding is mine.
emp17344 22 minutes ago [-]
I feel like practically anybody could solve Tower of Hanoi for any degree of complexity using this algorithm. It’s a four step process that you just repeat over and over.
vidarh 5 hours ago [-]
> Second-graders can follow that, if motivated enough.
Try to motivate them sufficiently to do so without error for a large number of disks, I dare you.
Now repeat this experiment while randomly refusing to accept the answer they're most confident in for any given iteration, and pick an answer they're less confident in on their behalf, and insist they still solve it without error.
(To make it equivalent to the researchers running this with temperature set to 1)
akoboldfrying 6 hours ago [-]
> obnoxious shouting on every channel about how these AI models have solved the Turing test (one wonders just how stupid these "evaluators" were)
Huh? Schoolteachers and university professors complaining about being unable to distinguish ChatGPT-written essay answers from student-written essay answers is literally ChatGPT passing the Turing test in real time.
amelius 5 hours ago [-]
It's a Turning test with human-prefiltered responses at best.
delusional 5 hours ago [-]
No it's not. The traditional interpretation of the Turing test requires interactivity. That is, the evaluator is allowed to ask questions and will receive a response from both a person and a machine. The idea is that there should be no sequence of questions you can ask that would reliably identify the machine. That's not even close to true for these "AI" systems.
absummer 5 hours ago [-]
The original Turing game was about testing for a male of female player.
If you want to know more about that, or this research, you could try asking AI for a no-fluff summary.
The Transformer architecture and algorithm and matrix multiplication are a bit more involved. It would be hard to keep those inside your chain-of-thought / working memory and still understand what is going on here.
delusional 4 hours ago [-]
> If you want to know more about that, or this research, you could try asking AI for a no-fluff summary.
Or I could just read it. With my human eyes. It's like a single page.
4 hours ago [-]
akoboldfrying 5 hours ago [-]
You're right about interactivity, something that I overlooked -- but I think it's nevertheless the case that a large fraction of human interrogators could not distinguish a human from a suitably-system-prompted ChatGPT even over the course of an interactive discussion.
ChatGPT 4.5 was judged to be the human 73% of the time in this RCT study, where human interrogators had 5-minute conversations with a human and an LLM: https://arxiv.org/pdf/2503.23674
Joeboy 5 hours ago [-]
This is kind of an irrelevant (and doubtless unoriginal) shower thought here but, if humans are judging the AI to be human much more often than the human, surely that means the AI is not faithfully reproducing human behaviour.
akoboldfrying 4 hours ago [-]
Sure, a non-human's performance "should" be capped at ~50% for a large sample size. I think seeing a much higher percentage, like 73%, indicates systematic error in the interrogator. This -- the fact that humans are not good at detecting genuine human behaviour -- is really a problem in the Turing test itself, but I don't see a good way to solve it.
LLaMa 3.1 with the same prompt "only" managed to be judged human 56% of the time, so perhaps it's actually closer to real human behaviour.
delusional 4 hours ago [-]
This comes down to the interpretation of the Turing test. Turing's original test actually pitted the two "unknowns" against each other. Put simply, both the human and the computer would try to make you believe they were the person. The objective of the game was to be seen as human, not to be indistinguishable from human.
This is obviously not quite what people understand the Turing test as anymore, and I think that interpretation confusion actually ends up weakening the linked paper. Your thought aptly describes a problem with the paper, but that problem is not present in the Turing test by its original formulation.
akoboldfrying 4 hours ago [-]
If you're referring to the paper I linked to, their experiments use bona fide 3-party Turing tests as per Turing's original "Imitation Game".
delusional 1 hours ago [-]
It's hard to say what a "bona fide 3-party Turing test" is. The paper even has a section trying to tackle that issue.
I think trying to discuss the minutia of the rules is a path that leads only to madness. The Turing test was always meant to be a philosophical game. The point was to establish a scenario in which a computer could be indistinguishable from a human. Carrying it out in reality in meaningless, unless you're willing to abandon all intuitive morality.
Quite frankly, I find the paper you linked misguided. If it was undertaken by some college students, then it's good practice, but if it was carried out by seasoned professionals they should find something better to do.
lostmsu 3 hours ago [-]
Shameless self-plug: You can try a two-way variant at https://trashtalk.borg.games/ (also have to guess relative ELO)
It would be surprising if you won't quickly learn to win.
4 hours ago [-]
absummer 6 hours ago [-]
The painful thing about achieving AGI, is that humans reasoning about AI will seem so dumb.
4 hours ago [-]
K0balt 6 hours ago [-]
[flagged]
empiko 5 hours ago [-]
> LLMs are human culture compiled into code. They will strongly tend to follow patterns of human behavior, to the point of that being their main (only?) feature.
Perhaps, but this already disproves the idea of superhuman PhD+ level AI agents.
---
I think that the paper has a good motivation. It would be great if we were able to somehow define the complexity of problems AIs are able to tackle. But the Hanoi towers puzzle does not seem like a good match, especially since you can generate the solution mechanically, if you are aware of the very simple algorithm.
K0balt 5 hours ago [-]
> Perhaps, but this already disproves the idea of superhuman PhD+ level AI agents.
The value in AI is not some higher level of understanding, but rather the ability to (potentially) carry on millions of independent and interleaved thought threads/conversations at once, and to work tirelessly at a high throughput.
Perhaps with some kind of recursive approach with iterative physical grounding that tests hypothesis against ground truth, AI can transcend human levels of understanding, but for now we need to understand that AI is going to be more of intern-level assistance with occasional episodes of drunk uncle bob.
K0balt 5 hours ago [-]
Your comment sparked a thought that might be of value.
If LLMs are indeed primarily useful for solving simpler tasks and will systematically balk at complex problems, perhaps a variant of the “thinking” approach is of value?
Where a task that is high iteration is approached by solving a part of the problem and then identifying the next -small part- of the problem, in an iteration loop.
I can also see this easily going off the rails, but perhaps posing each following iteration as -small- and encapsulating only the relevant context while trimming off the context tail could work as an agentic behavior spawned from the main supervisory task?
Of course maybe just having the model write and run code would probably be better for many classes of problems.
Maybe something to be looked at, if it’s not already being used.
vidarh 5 hours ago [-]
A pet peeve of mine is that these papers practically never benchmark against humans before they make grandiose claims of how bad these models are.
If you tried to get a human to write out a solution to Tower of Hanoi with a large number of disks, you'd get a whole lot of refusals, and even if you cajoled people into doing it, you'd get a whole lot of errors because people would get sloppy.
That said, there were some useful bits - basically the notion that we can't expect to just ramp inference budgets higher and higher for reasoning models and keep getting gains without improving the underlying LLMs too.
I just wish they'd been more rigorous, and stuck to that.
KingMob 4 hours ago [-]
To paraphrase GOB Bluth:
"Illusions, Michael! Thinking is something a whore does for money!"
...slow pan to shocked group of staring children...
"..or cocaine!"
talles 5 hours ago [-]
Someone please reply with the title "The illusion of The illusion of The illusion of Thinking".
I found this comment to be relevant: "Keep in mind this whitepaper is really just Apple circling the wagons because they have dick for proprietary AI tech."
When you question the source, it really does raise eyebrows, especially as an Apple shareholder: that these Apple employees are busy not working on their own AI programme that's now insanely far behind other big tech companies, but are instead spending their time casting shade on the reasoning models developed at other AI labs.
What's the motivation here, really? The paper itself isn't particularly insightful or ground-breaking.
tikhonj 5 hours ago [-]
The motivation is doing research to better understand AI?
People's time and attention is not fungible—especially in inherently creative pursuits like research—and the mindset in your comment is exactly the sort of superficial administrative reasoning that leads to hype bubbles unconstrained by reality.
"Why are you wasting your time trying to understand what we're doing instead of rushing ahead without thinking" is absolutely something I've heard from managers and executives, albeit phrased more politically, and it never ends well in a holistic accounting.
emp17344 6 hours ago [-]
The paper was written by highly accomplished ML researchers who don’t have any stake in Apple’s continued success. Framing this peer-reviewed research written by respected authors as “sour grapes” is intellectually dishonest.
reliabilityguy 5 hours ago [-]
> Framing this peer-reviewed research
How do you know it was peer-reviewed? What venue had accepted this paper for publication?
tough 5 hours ago [-]
how many papers do apple publish under their own CDN/domains
this was certainly a first for me when i saw it pop on hn the other day
It's easy to argue about the people who write the paper and their incentives. It takes a lot more effort to prove that the data, the procedure or the conclusion in the paper has flaws, and back it up.
android521 5 hours ago [-]
If they get paid by apple, they have a stake
emp17344 5 hours ago [-]
This kind of statement isn’t productive. Everyone has a bias. If you don’t believe the paper is valid, I’d like to hear your substantive critique.
JimDabell 5 hours ago [-]
This is ridiculous. If Apple wanted to make competing AI look bad, getting some researchers to publish a critical paper is hardly going to have any kind of worthwhile outcome.
smitty1e 6 hours ago [-]
> I found this comment to be relevant: "Keep in mind this whitepaper is really just Apple circling the wagons because they have dick for proprietary AI tech."
Now, if we fed the relevant references into an AI model, would the model offer this as a possible motive for the paper in question?
K0balt 6 hours ago [-]
Probably.
Rendered at 16:27:02 GMT+0000 (Coordinated Universal Time) with Vercel.
The paper published by Apple should not be considered as a definitive statement, but more like a hint and a conversation starter.
In think that many important actors in the AI field are making predictions about the future and potential of AI with almost nothing to back their claims.
>The first issue I have with the paper is that Tower of Hanoi is a worse test case for reasoning than math and coding. If you’re worried that math and coding benchmarks suffer from contamination, why would you pick well-known puzzles for which we know the solutions exist in the training data?
Isn't that exactly what is wrong? It is in the training data and it cant complete it.
It simply isn't reasoning, it is second guessing a lot of things as though it is reasoning.
Although there isn't a vast corpus on Method Ringing, there is a fair amount; the "rules" are online (https://framework.cccbr.org.uk/version2/index.html), Change ringing is based on pure maths (Group Theory) and has been linked with CS from when CS first started - it's mentioned in Knuth, and the Steinhaus–Johnson–Trotter algorithm for generating permutations wasn't invented by them in the 1960's, it was known to Change Ringers in the 1650's. Think of it of Towers of Hanoi with knobs on :-) So it would seem a good fit for automated reasoning, indeed such things already exist - https://ropley.com/?page_id=25777.
If I asked a non-ringing human to explain to me how to ring Cambridge Major, they'd say "Sorry, I don't know" and an LLM with insufficient training data would probably say the same. The problem is when LLMs know just enough to be dangerous, but they don't know what they don't know. The more abstruse a topic is, the worse LLMs are going to do at it, and it's precisely those areas where people are most likely to turn to them for answers. They'll get one that's grammatically correct and sounds authoritative - but they almost certainly won't know if it's nonsense.
Adding a "reliability" score to LLM output seems eminently feasible, but due to the hype and commercial pressures around the current generation of LLMs, that's never going to happen as the pressure is on to produce plausible sounding output, even if it's bullshit.
https://www.lawgazette.co.uk/news/appalling-high-court-judge...
Here's a quote from the article:
> How many humans can sit down and correctly work out a thousand Tower of Hanoi steps? There are definitely many humans who could do this. But there are also many humans who can’t. Do those humans not have the ability to reason? Of course they do! They just don’t have the conscientiousness and patience required to correctly go through a thousand iterations of the algorithm by hand. (Footnote: I would like to sit down all the people who are smugly tweeting about this with a pen and paper and get them to produce every solution step for ten-disk Tower of Hanoi.)
In case someone imagines that fancy recursive reasoning is necessary to solve the Towers of Hanoi, here's the algorithm to move 10 (or any even number of) disks from peg A to peg C:
1. Move one disk from peg A to peg B or vice versa, whichever move is legal.
2. Move one disk from peg A to peg C or vice versa, whichever move is legal.
3. Move one disk from peg B to peg C or vice versa, whichever move is legal.
4. Goto 1.
Second-graders can follow that, if motivated enough.
There's now constant, nonstop, obnoxious shouting on every channel about how these AI models have solved the Turing test (one wonders just how stupid these "evaluators" were), are at the level of junior devs (LOL), and actually already have "PhD level" reasoning capabilities.
I don't know who is supposed to be fooled -- we have access to these things, we can try them. One can easily knock out any latest version of GPT-PhD-level-model-of-the-week with a trivial question. Nothing fundamentally changed about that since GPT-2.
The hype and the observable reality are now so far apart that one really has to wonder: Are people this easily gullible? Or do so many people in tech benefit from the hype train that they don't want to rain on the parade?
It is actually worse than that analogy: Towers of Hanoi is a bimodal puzzle, in which players who grasp the general solution do inordinately better than those who do not, and the machines here are performing like the latter.
Lest anyone thinks otherwise, this is not a case of setting up the machines to fail, any more than the chess analogy would be. The choice of Towers of Hanoi leaves it conceivable that they do would well on tough problems, but that is not very plausible and needs to be demonstrated before it can be assumed.
Please do correct me if the misunderstanding is mine.
Try to motivate them sufficiently to do so without error for a large number of disks, I dare you.
Now repeat this experiment while randomly refusing to accept the answer they're most confident in for any given iteration, and pick an answer they're less confident in on their behalf, and insist they still solve it without error.
(To make it equivalent to the researchers running this with temperature set to 1)
Huh? Schoolteachers and university professors complaining about being unable to distinguish ChatGPT-written essay answers from student-written essay answers is literally ChatGPT passing the Turing test in real time.
If you want to know more about that, or this research, you could try asking AI for a no-fluff summary.
The Transformer architecture and algorithm and matrix multiplication are a bit more involved. It would be hard to keep those inside your chain-of-thought / working memory and still understand what is going on here.
Or I could just read it. With my human eyes. It's like a single page.
ChatGPT 4.5 was judged to be the human 73% of the time in this RCT study, where human interrogators had 5-minute conversations with a human and an LLM: https://arxiv.org/pdf/2503.23674
LLaMa 3.1 with the same prompt "only" managed to be judged human 56% of the time, so perhaps it's actually closer to real human behaviour.
This is obviously not quite what people understand the Turing test as anymore, and I think that interpretation confusion actually ends up weakening the linked paper. Your thought aptly describes a problem with the paper, but that problem is not present in the Turing test by its original formulation.
I think trying to discuss the minutia of the rules is a path that leads only to madness. The Turing test was always meant to be a philosophical game. The point was to establish a scenario in which a computer could be indistinguishable from a human. Carrying it out in reality in meaningless, unless you're willing to abandon all intuitive morality.
Quite frankly, I find the paper you linked misguided. If it was undertaken by some college students, then it's good practice, but if it was carried out by seasoned professionals they should find something better to do.
It would be surprising if you won't quickly learn to win.
Perhaps, but this already disproves the idea of superhuman PhD+ level AI agents.
---
I think that the paper has a good motivation. It would be great if we were able to somehow define the complexity of problems AIs are able to tackle. But the Hanoi towers puzzle does not seem like a good match, especially since you can generate the solution mechanically, if you are aware of the very simple algorithm.
The value in AI is not some higher level of understanding, but rather the ability to (potentially) carry on millions of independent and interleaved thought threads/conversations at once, and to work tirelessly at a high throughput.
Perhaps with some kind of recursive approach with iterative physical grounding that tests hypothesis against ground truth, AI can transcend human levels of understanding, but for now we need to understand that AI is going to be more of intern-level assistance with occasional episodes of drunk uncle bob.
If LLMs are indeed primarily useful for solving simpler tasks and will systematically balk at complex problems, perhaps a variant of the “thinking” approach is of value?
Where a task that is high iteration is approached by solving a part of the problem and then identifying the next -small part- of the problem, in an iteration loop.
I can also see this easily going off the rails, but perhaps posing each following iteration as -small- and encapsulating only the relevant context while trimming off the context tail could work as an agentic behavior spawned from the main supervisory task?
Of course maybe just having the model write and run code would probably be better for many classes of problems.
Maybe something to be looked at, if it’s not already being used.
If you tried to get a human to write out a solution to Tower of Hanoi with a large number of disks, you'd get a whole lot of refusals, and even if you cajoled people into doing it, you'd get a whole lot of errors because people would get sloppy.
That said, there were some useful bits - basically the notion that we can't expect to just ramp inference budgets higher and higher for reasoning models and keep getting gains without improving the underlying LLMs too.
I just wish they'd been more rigorous, and stuck to that.
"Illusions, Michael! Thinking is something a whore does for money!"
...slow pan to shocked group of staring children...
"..or cocaine!"
I found this comment to be relevant: "Keep in mind this whitepaper is really just Apple circling the wagons because they have dick for proprietary AI tech."
When you question the source, it really does raise eyebrows, especially as an Apple shareholder: that these Apple employees are busy not working on their own AI programme that's now insanely far behind other big tech companies, but are instead spending their time casting shade on the reasoning models developed at other AI labs.
What's the motivation here, really? The paper itself isn't particularly insightful or ground-breaking.
People's time and attention is not fungible—especially in inherently creative pursuits like research—and the mindset in your comment is exactly the sort of superficial administrative reasoning that leads to hype bubbles unconstrained by reality.
"Why are you wasting your time trying to understand what we're doing instead of rushing ahead without thinking" is absolutely something I've heard from managers and executives, albeit phrased more politically, and it never ends well in a holistic accounting.
How do you know it was peer-reviewed? What venue had accepted this paper for publication?
this was certainly a first for me when i saw it pop on hn the other day
Doesn’t mean they are peer reviewed
Now, if we fed the relevant references into an AI model, would the model offer this as a possible motive for the paper in question?