> they lack an explicit architecture for the executive control of attention found in humans
Deceptive terminology strikes again! The "attention" mechanism in transformers appears (to my understanding at least) to have about as much to do with human attention as the "neurons" in a multi-layer perceptron have to do with biological neurons.
That said, the core premise of building in something that mimics executive function is an intriguing one (which I assume has been explored before but it's not something I'm familiar with).
nico 14 hours ago [-]
Is this the AI version of ADHD?
LoganDark 14 hours ago [-]
My first thought too. AI-DHD?
quotemstr 18 hours ago [-]
The first thing I do when I see a paper that claims transformers fundamentally can't do X or Y is to look at the models under test:
> To evaluate generalizability, we conducted tests of GPT-5 (41), Claude Opus 4.1 (42), and Gemini 2.5 Pro (43) from 2025 September
The problem with empirical negative results on LLMs is that they can't rule out that the alleged deficiencies disappear with increased scale and the right fine-tuning. It's like saying my dog has trouble with subject-verb agreement, so meat brains are "fundamentally limited in their capacity for grammar".
I can accept that current LLMs (even latest generation) might exhibit cognitive gaps similar to those we see in humans with deficient executive function, I can't accept these gaps as evidence of fundamental limits of the transformer architecture. LLMs are universal function approximators. Executive function is a function. Yes, yes, it's well-known that transformers have a circuit complexity limit set by layer count and whatever. The limit disappears once you allow for autoregression. Nobody cares about the limits of AI inside a single forward pass.
I have high confidence that with the right sort of training, executive function gaps in LLM can be addressed. I'm not convinced that the problem is the architecture per se.
vlovich123 16 hours ago [-]
You’re just complaining they can’t prove a negative, which is literally impossible.
“I can accept fairies don’t exist today but that doesn’t mean fairies won’t exist in the future.”
The burden of proof lies in those claiming the transformer is able to do something like this. In fact, given that our brains don’t have anything resembling transformers, they don’t learn anything like we train models, and they have all sorts of integrated memory mechanisms we simply do party tricks around with vector databases, I think it’s safer to err on the side of assuming existing transformers failing in very specific ways that human brains do not generally. Also, we clearly haven’t really seen major architectural changes for transformers for a few years now. Most of it has been RL gains, not structural improvements. So it stands to reason that the deficiencies will remain even if we figure out ways to paper over it on a case by case basis.
quotemstr 15 hours ago [-]
Yes, I am complaining that they are making an impossibility claim on the basis of an observational gap. Such claims don't have a great track record in the history of science.
> negative, which is literally impossible.
Impossibility proofs are common in mathematics, physics, and computer science. This paper is not one of them. It reports an observational gap. That's not the same thing at all as showing, e.g. that any transformer no matter how large or interconnected, can't compute some function.
> our brains
Airliners don't have feathers.
> we clearly haven’t really seen major architectural changes for transformers for a few years now.
Ever read a DeepSeek paper? Ever hear of MLA? Mamba? Or gated deltanet? Or RLMs? Universal transformers? There's been a deluge of architectural advancement over the past few years. You shouldn't go around asserting the burden of proof falls on this or that party if you're not familiar enough with the recent literature to recognize the kinds of proof that would satisfy this burden.
> deficiencies will remain even if we figure out ways to paper over it on a case by case basis.
I think there are general solutions unknown to us for classes of problem we solve one by one through brute force today. Not arguing that. I just don't accept that the path to generality goes through giving up "transformers", whatever this term means after the architectural Cambrian explosion of the past few years.
It's much more likely that further capability unlocks involve in-the-weights continuous online learning. How we do that is orthogonal to whether the weights encode a transformer, a diffusion model, a SSM, or something more exotic.
Sure, these things aren't pure transformers. But neither are frontier models. The industry is already doing what you suggest and moving beyond naive KQ dot product full depth everywhere 2010s-era transformers. Architectural innovation hasn't solved the problem. Turns out different architectures for approximating functions all form function approximators. The problem is in formulating the functions we want to approximate, not our spelling of the approximation engine!
vlovich123 15 hours ago [-]
> Ever read a DeepSeek paper? Ever hear of MLA? Mamba? Or gated deltanet? Or RLMs? Universal transformers?
Quite a few of those aren’t transformer architectures, MLA is more of KV optimization that doesn’t degrade intelligence than something that directly improves intelligences. Indirectly it lets you run a larger model on the same hardware but that’s it. It’s also 2 years old while universal transformers are 8 years old and only MLA has seen adoption. Your reply was full of gish gallop nonsense that argues against anything really new in transformers capabilities with intelligence.
> Not arguing that. I just don't accept that the path to generality goes through giving up "transformers", whatever this term means after the architectural Cambrian explosion of the past few years.
I mean dLLMs are quite architecturally different from plain LLM transformers that you get on OpenAI or Anthropic today even if the use transformers if you squint at them - they’re bidirectional thinking and embarrassingly parallel. Why would the next explosion not be architecturally different from the previous one? Indeed you’d expect a difference because anything that can overcome today’s transformers has to be exponentially better and anything based around transformers won’t be and there’s clearly still a few orders of magnitude between humans and LLMs.
> Sure, these things aren't pure transformers. But neither are frontier models. The industry is already doing what you suggest and moving beyond naive KQ dot product full depth everywhere 2010s-era transformers.
But they’re not, not really. The difference between Llama 3.2 and Claude Fabel architecturally is relatively small, with most of the gains coming from RL, training data, size, training systems, and inference loop infrastructure. It’s all clearly made a huge difference but structurally there hasn’t been huge structural changes; most of the structural changes are around inference efficiency and trying to optimize performance without sacrificing intelligence. At some point you’ll run out of headroom of how far you can take that and that point will be a far way away from AGI.
zmgsabst 15 hours ago [-]
And chemistry could stop conserving energy tomorrow!
But what does the preponderance of observed evidence tell us is likely the case? For both conserved energy and transformer behavior.
You’re just outlining the problem of induction, but that doesn’t move the conversation forward, because the person clearly already understood that point and was (much like energy conservation) inductively proposing a rule.
derbOac 17 hours ago [-]
You might be completely correct, although my hunch is this is something that would require a change in architecture rather than increases in scale.
The failure points happen in a fairly simple task (Stroop) with increases in repetition of trials. It's not like the number of colors or color words is increasing, which is the sort of thing I might expect if it had to do with the size of the LLM.
On the other hand who knows. I agree that model scale changes make a lot of things a moving target.
At first I thought this paper was kind of odd, but then I felt like it was maybe possibly onto something important. Intuitively I could see the possibility that whatever is causing this failure in the Stroop task might be related to the tendency of LLMs to be "derailable".
roenxi 11 hours ago [-]
Aren't transformers universal function approximators? It seems pretty easy to see executive function as a simple computation. So it would be trivially true that a sufficiently large transformer could model executive function because it could approximate [current transformer] + [an approximation of the executive function algorithm] + [whatever bloat is needed to store state in a transformer].
It seems hard to come up with an argument that executive function can't possibly be approximated with an algorithm. Executive function is basic once the clustering into objects part of the process is done. The only real questions are whether a transformer of sufficient scale is feasible on current hardware and if the engineers with access to the hardware have figured out what to train for yet.
14 hours ago [-]
ivanvoid 18 hours ago [-]
this is a nice study but i don’t think it’s actually good argument
Rendered at 18:18:02 GMT+0000 (Coordinated Universal Time) with Vercel.
Deceptive terminology strikes again! The "attention" mechanism in transformers appears (to my understanding at least) to have about as much to do with human attention as the "neurons" in a multi-layer perceptron have to do with biological neurons.
That said, the core premise of building in something that mimics executive function is an intriguing one (which I assume has been explored before but it's not something I'm familiar with).
> To evaluate generalizability, we conducted tests of GPT-5 (41), Claude Opus 4.1 (42), and Gemini 2.5 Pro (43) from 2025 September
The problem with empirical negative results on LLMs is that they can't rule out that the alleged deficiencies disappear with increased scale and the right fine-tuning. It's like saying my dog has trouble with subject-verb agreement, so meat brains are "fundamentally limited in their capacity for grammar".
I can accept that current LLMs (even latest generation) might exhibit cognitive gaps similar to those we see in humans with deficient executive function, I can't accept these gaps as evidence of fundamental limits of the transformer architecture. LLMs are universal function approximators. Executive function is a function. Yes, yes, it's well-known that transformers have a circuit complexity limit set by layer count and whatever. The limit disappears once you allow for autoregression. Nobody cares about the limits of AI inside a single forward pass.
I have high confidence that with the right sort of training, executive function gaps in LLM can be addressed. I'm not convinced that the problem is the architecture per se.
“I can accept fairies don’t exist today but that doesn’t mean fairies won’t exist in the future.”
The burden of proof lies in those claiming the transformer is able to do something like this. In fact, given that our brains don’t have anything resembling transformers, they don’t learn anything like we train models, and they have all sorts of integrated memory mechanisms we simply do party tricks around with vector databases, I think it’s safer to err on the side of assuming existing transformers failing in very specific ways that human brains do not generally. Also, we clearly haven’t really seen major architectural changes for transformers for a few years now. Most of it has been RL gains, not structural improvements. So it stands to reason that the deficiencies will remain even if we figure out ways to paper over it on a case by case basis.
> negative, which is literally impossible.
Impossibility proofs are common in mathematics, physics, and computer science. This paper is not one of them. It reports an observational gap. That's not the same thing at all as showing, e.g. that any transformer no matter how large or interconnected, can't compute some function.
> our brains
Airliners don't have feathers.
> we clearly haven’t really seen major architectural changes for transformers for a few years now.
Ever read a DeepSeek paper? Ever hear of MLA? Mamba? Or gated deltanet? Or RLMs? Universal transformers? There's been a deluge of architectural advancement over the past few years. You shouldn't go around asserting the burden of proof falls on this or that party if you're not familiar enough with the recent literature to recognize the kinds of proof that would satisfy this burden.
> deficiencies will remain even if we figure out ways to paper over it on a case by case basis.
I think there are general solutions unknown to us for classes of problem we solve one by one through brute force today. Not arguing that. I just don't accept that the path to generality goes through giving up "transformers", whatever this term means after the architectural Cambrian explosion of the past few years.
It's much more likely that further capability unlocks involve in-the-weights continuous online learning. How we do that is orthogonal to whether the weights encode a transformer, a diffusion model, a SSM, or something more exotic.
Sure, these things aren't pure transformers. But neither are frontier models. The industry is already doing what you suggest and moving beyond naive KQ dot product full depth everywhere 2010s-era transformers. Architectural innovation hasn't solved the problem. Turns out different architectures for approximating functions all form function approximators. The problem is in formulating the functions we want to approximate, not our spelling of the approximation engine!
Quite a few of those aren’t transformer architectures, MLA is more of KV optimization that doesn’t degrade intelligence than something that directly improves intelligences. Indirectly it lets you run a larger model on the same hardware but that’s it. It’s also 2 years old while universal transformers are 8 years old and only MLA has seen adoption. Your reply was full of gish gallop nonsense that argues against anything really new in transformers capabilities with intelligence.
> Not arguing that. I just don't accept that the path to generality goes through giving up "transformers", whatever this term means after the architectural Cambrian explosion of the past few years.
I mean dLLMs are quite architecturally different from plain LLM transformers that you get on OpenAI or Anthropic today even if the use transformers if you squint at them - they’re bidirectional thinking and embarrassingly parallel. Why would the next explosion not be architecturally different from the previous one? Indeed you’d expect a difference because anything that can overcome today’s transformers has to be exponentially better and anything based around transformers won’t be and there’s clearly still a few orders of magnitude between humans and LLMs.
> Sure, these things aren't pure transformers. But neither are frontier models. The industry is already doing what you suggest and moving beyond naive KQ dot product full depth everywhere 2010s-era transformers.
But they’re not, not really. The difference between Llama 3.2 and Claude Fabel architecturally is relatively small, with most of the gains coming from RL, training data, size, training systems, and inference loop infrastructure. It’s all clearly made a huge difference but structurally there hasn’t been huge structural changes; most of the structural changes are around inference efficiency and trying to optimize performance without sacrificing intelligence. At some point you’ll run out of headroom of how far you can take that and that point will be a far way away from AGI.
But what does the preponderance of observed evidence tell us is likely the case? For both conserved energy and transformer behavior.
You’re just outlining the problem of induction, but that doesn’t move the conversation forward, because the person clearly already understood that point and was (much like energy conservation) inductively proposing a rule.
The failure points happen in a fairly simple task (Stroop) with increases in repetition of trials. It's not like the number of colors or color words is increasing, which is the sort of thing I might expect if it had to do with the size of the LLM.
On the other hand who knows. I agree that model scale changes make a lot of things a moving target.
At first I thought this paper was kind of odd, but then I felt like it was maybe possibly onto something important. Intuitively I could see the possibility that whatever is causing this failure in the Stroop task might be related to the tendency of LLMs to be "derailable".
It seems hard to come up with an argument that executive function can't possibly be approximated with an algorithm. Executive function is basic once the clustering into objects part of the process is done. The only real questions are whether a transformer of sufficient scale is feasible on current hardware and if the engineers with access to the hardware have figured out what to train for yet.