> This is what most programmers do. They type raw text into the editor; the compiler either processes it into structured data, or returns an error the programmer has to internalise before resubmitting.
In the 1980s structural editors were quite popular (fe the basic editor in the ZX81). Using these, it is impossible for the programmer to create text that is not a valid program.
jnpnj 23 hours ago [-]
I never saw any structured editor on these machines, how did they operate ? grammar guided insertion ?
deskamess 20 hours ago [-]
Not a technical answer but when we started up the system (zx 16k) we were in a prompt. We would add commands with line numbers. After each line number the list of possible commands were embossed on the keyboard and you would start with that (if, peek, poke, etc). What you could complete was limited by that. Edit: BASIC programming
> It might appear that this is an argument against scale, and the Bitter Lesson. That is not the case. I see this as a move that lets scale do its work on the right object. As with chess, where encoding the game rules into training produces a leap that no amount of inference-time search can today match, the move here is to encode the programming language itself into the training, and apply scale on a structure that actually reflects what we’re trying to produce.
One way to think of the bitter lesson as it applies to generative models is that ~all data carries some information about the structure of reality, and architectures that let you train on more data are better because they learn better underlying world models. Knowledge transfers: LLMs are good at writing code partly because they've seen a lot of code, but also because they understand (at least to some extent) the relationship between that code and the rest of the world. Constraining a model's output structure also constrains the data that is available to train it. So the big question is whether you can actually meaningfully scale training with these kinds of strictly structured outputs.
davebren 17 hours ago [-]
At the same time treating everything as tokens and next word prediction will never produce any real understanding like what humans do when they learn how to program. The bitter lesson is an admission that we still have no clue what is at the core of human learning and reasoning so we have to brute force it with tons of data generated by humans. I also don't know if expert systems and ML techniques like feature extraction are really any worse in practice or if we just didn't have enough engineering resources or a proper way to organize and scale their development. They seemed to work quite well in a lot of cases with more predictable results and several orders of magnitude less compute. And LLMs still suffer the long-tail problem despite their insane amounts of data.
If we're at the end of the data and most new data is now produced by LLMs with little human oversight, where do we go? Seems like figuring out ways to mix LLMS with more structured models that can reliably handle important classes of problems is the next logical step. In a way that is what programming languages and frameworks/libraries are doing, but they've massively disincentivized work on those by claiming that LLMS will do everything.
The chess example is a good one, it's effectively solved so why shouldn't an LLM have a submodule that it can use to play chess and save some energy.
bgavran 14 hours ago [-]
Author here - thanks for engaging.
> One way to think of the bitter lesson as it applies to generative models is that ~all data carries some information about the structure of reality
Completely agree. It might have not come across, but what I'm pointing out in the post is that the data as it is currently encoded in the models is needlessly lossy. Tokens do not reveal all the information we have at our disposal.
In natural language, that's fine, because it's quite loose in structure.
But if our domain is heavily structured (like modern programming languages are), why reveal only snippets of linearised syntax of that structure to the model? Why not reveal the full structure we have at our disposal?
> and architectures that let you train on more data are better because they learn better underlying world models.
By this argument, wouldn't we conclude that training on chess using the game structure wouldn't work either, since that'd be a model that uses less data?
> The reason I put off starting the series for so long is one of the same reasons blocking the writing of the paper: some of the introductory material is some of the most difficult to write. It has been such a long time that I no longer know how to adequately explain why the problem is so difficult.
My sympathies to Jules
pron 20 hours ago [-]
The problem is that the search space is so large that correcting errors via guardrails is only effective if the original error rate is low (how many Integer -> Integer functions are there? There's ~1 way to get it right and ~∞ ways to get it wrong).
Sure, we can help the easy cases, but that's because they're easy to begin with. In general, we know (or at least assume) that being able to check a solution tractably does not make finding the solution tractable, or we'd know that NP = P. So if LLMs could effectively generate a proof that they've found the correct Integer -> Integer function, either that capability will be very limited or we've broken some known or assumed computational complexity limit. As Philippe Schnoebelen discovered in 2002 [1], languages cannot reduce the difficulty of program construction or comprehension.
Of course, it is possible that machine learning could learn some class of problems previously unknown to be in P and find that it is in P, but we should understand that that is what it's done: realised that the problem was easy to begin with rather than finding a solution to a hard problem. This is valuable, but we know that hard problems that are of great interest do exist.
>As Philippe Schnoebelen discovered in 2002 [1], languages cannot reduce the difficulty of program construction or comprehension.
From a model-checking point of view. This is about taking a proof-theoretic approach...
Your last paragraph is also quite wrong: a machine learning could very well easily learn and solve an NP-complete problem, because this property does not say anything about average case complexity (and we should consider Probabilistic complexity classes, so the picture is even more "complex").
pron 14 hours ago [-]
> From a model-checking point of view. This is about taking a proof-theoretic approach...
No. In complexity theory we deal with problems, and the model-checking problem is that of determining whether a program satisfies some property or not. If your logic is sound, you can certainly use an algorithm based on the logic's deductive theory (which could be type theory, but that's an unimportant detail) to decide the problem, but that can have no impact whatsoever on the complexity of the problem. The result applies to all decision procedures, be they model-theoretic or deductive (logic-theoretic).
> Your last paragraph is also quite wrong: a machine learning could very well easily learn and solve an NP-complete problem, because this property does not say anything about average case complexity
No. First, it's unclear what "average complexity" means here, but for any reasonable definition, the "average complexity" of NP-hard problems is not known to be tractable. Second, complexity theory approaches this issue (of "some instances may be easier") using parameterised complexity [1], and I'm afraid that the results for the model-checking problem - which, again, is the inherent difficulty of knowing what a program does regardless of how you do it - are not very good. I mentioned such a result in an old blog post of mine here [2]. (Parameterised complexity is more applicable than probabilistic complexity here because even if there were some reasonable distribution of random instances, it's probably not the distribution we'd care about.)
There is no escape from complexity limits, and the best we hope for is to find out that problems we're interested in have actually been easier than we thought all along. Of course, some people believe that the programs people actually write are somehow in a tractable complexity class that we've not been able to define - and maybe one day we'll discover that that's the case - but what we've seen so far suggests it isn't: If programs that people write are somehow easier to analyse, then we'd expect to see the size of programs we can soundly analyse grow at the same pace as the size of programs people write, and nothing can be further from what we've observed. The size of programs that can be "proven correct" (especially using deductive methods!) has remained largely the same for decades, while the size of programs people write has grown considerably over that period of time.
In my mind the main problem here is setting up the environment for training the LLM and ensuring that there's enough high quality training data for consumption. Getting an environment set up for a single project is non-trivial - here I'm assuming that you want something similar to autocomplete in an IDE or language server integration. Even if you could set this up, are there enough projects to even train on in the first place?
Maybe this set-up will work for Haskell, but you can abandon any hope of setting up environments for C or C++. Even languages like Rust or C# may be impossible to train on, despite the build chain being a bit nicer than C or C++.
Daffrin 23 hours ago [-]
The connection between type systems and neural net structure is underexplored in practice. One thing I'd add: when you're dealing with multi-modal inputs in production — say, mixed structured and unstructured content — the type-safety problem compounds. You end up with implicit contracts at inference boundaries that are very hard to enforce.
Has the author written anything on how this applies to transformer architectures specifically? The attention mechanism seems like a place where a richer type theory would be genuinely useful.
I'm not sure what to make of TFA (I don't have time right now to investigate in details, but the subject it interesting). It starts with saying you can stop generation as soon as you have an output that can't be completed -- and there's already more advanced techniques that do that. If your language is typed, then you can use a "proof tree with a hole" and check whether there's a possible completion of that tree.
References are "Type-Constrained Code Generation with Language Models" and "Statically Contextualizing Large Language Models with Typed Holes".
Then it switches to using an encoding that would be more semantic, but I think the argument is a bit flimsy: it compares chess to the plethora of languages that LLM can spout somewhat correct code for (which is behind the success of this generally incorrect approach).
What I found more dubious is that it brushed off syntactical differences to say "yeah but they're all semantically equivalent". Which, it seems to me, is kind of the main problem about this; basically any proof is an equivalence of two things, but it can be arbitrarily complicated to see it. If we consider this problem solved, then we can get better things, sure...
I think without some e.g. Haskell PoC showing great results these methods will have a hard time getting traction.
Please correct any inaccuracies or incomprehension in this comment!
bgavran 24 hours ago [-]
Author here - thanks for engaging.
On existing techniques - Type-Constrained Generation paper is discussed in the blog post (under Constrained Decoding), and I'd group typed holes in the same bucket.
The problem with those methods is that they're inference time: they don't update the weights. In this case, constrained decoding prevents the model from saying certain things, without changing what the model wants to say. This is especially problematic the more complex your type systems get, without even taking into account that type inference is for many of these undecidable.
Meaning, if I give you a starting string, in the presence of polymorphisms and lambdas you might not always be able to tell whether it completes to a term of a particular type.
On the syntactic difference: I'd gently reframe. The question isn't whether syntactically different programs are semantically equivalent, it's that regardless of which form you pick, the existing methods don't let the model learn the constructor choice.
That's what the next section is about.
woolion 22 hours ago [-]
Thank you for your reply. FTR, I find the subject very interesting and I hope there will be more work on this line of approach.
>The problem with those methods is that they're inference time
I agree, I just thought it was missing some prior art (not affiliated with these papers :-P)
What is not clear to me at all is, is this the draft of a research idea?
Or is there already some implementation coming in a later post?
It seems to me that such an idea would be workable on a given language with a given type system, but it seems to me there would be a black magic step to train a model that would work in a language-agnostic manner. Could you clarify?
bgavran 14 hours ago [-]
There is an existing implementation validating this idea, and the plan is to make it publicly available at some point.
> It seems to me that such an idea would be workable on a given language with a given type system, but it seems to me there would be a black magic step to train a model that would work in a language-agnostic manner.
That's correct. The blog post alludes to infrastructure building as a necessary component of making that happen for that exact reason. I.e. while it's "easy" to generate a dependent pair in this way, generating an entire dependently typed AST is much more difficult. On the positive side, this is more of a software engineering effort rather than a research one.
woolion 3 hours ago [-]
Ok, thank you for the info. Do you have any idea when at some point might be? I'd love to check it out.
big-chungus4 1 days ago [-]
So the model generates code, and let's say it is wrongly typed, we then take the rightly typed version and use cross entropy between them? Is that right? That just sounds like the typical training, unless you can somehow take arbitrary code that the model generated and automatically find the rightly typed version, so you won't need a dataset for it
yorwba 1 days ago [-]
Rather than letting the model generate arbitrary code and type-checking it afterward, the author wants to pre-restrict the output with templates that are well-typed by construction and only let the model make choices between valid alternatives in that restricted output space.
ogogmad 20 hours ago [-]
"Even more, this coproduct can be thought of as an instance of a dependent pair indexed by a finite set" - I know what the individual terms mean, but I don't get what this is saying.
bgavran 16 hours ago [-]
A coproduct in the category Set is a disjoint union of sets, i.e. A + B + C where A, B, C are sets.
We can think of this coproduct as involving two choices:
1) a choice of which component of the coproduct we're interested in (first, second, or third)
2) a choice of an element of that component
That is, `A + B + C` is isomorphic to `(i : Fin 3 * D i)` where `Fin 3` is a set with three elements, and `D : Fin 3 - > Type` and
`D(0)=A`, `D(1)=B`, `D(2)=C`.
Then, the idea is: why index by a finite set? If you replace `Fin 3` by some arbitrary set, you start to be able to model a very general notion of a dependent type.
Rendered at 10:32:03 GMT+0000 (Coordinated Universal Time) with Vercel.
In the 1980s structural editors were quite popular (fe the basic editor in the ZX81). Using these, it is impossible for the programmer to create text that is not a valid program.
That was not a very good description so try this: https://www.usebox.net/jjm/notes/basic/ (scroll down for keyboard pic and also some code). Or this video : https://youtu.be/zgjGsNS6a0Y?t=167
https://www.zx81stuff.org.uk/zx81/jtyone.html
> It might appear that this is an argument against scale, and the Bitter Lesson. That is not the case. I see this as a move that lets scale do its work on the right object. As with chess, where encoding the game rules into training produces a leap that no amount of inference-time search can today match, the move here is to encode the programming language itself into the training, and apply scale on a structure that actually reflects what we’re trying to produce.
One way to think of the bitter lesson as it applies to generative models is that ~all data carries some information about the structure of reality, and architectures that let you train on more data are better because they learn better underlying world models. Knowledge transfers: LLMs are good at writing code partly because they've seen a lot of code, but also because they understand (at least to some extent) the relationship between that code and the rest of the world. Constraining a model's output structure also constrains the data that is available to train it. So the big question is whether you can actually meaningfully scale training with these kinds of strictly structured outputs.
If we're at the end of the data and most new data is now produced by LLMs with little human oversight, where do we go? Seems like figuring out ways to mix LLMS with more structured models that can reliably handle important classes of problems is the next logical step. In a way that is what programming languages and frameworks/libraries are doing, but they've massively disincentivized work on those by claiming that LLMS will do everything.
The chess example is a good one, it's effectively solved so why shouldn't an LLM have a submodule that it can use to play chess and save some energy.
> One way to think of the bitter lesson as it applies to generative models is that ~all data carries some information about the structure of reality
Completely agree. It might have not come across, but what I'm pointing out in the post is that the data as it is currently encoded in the models is needlessly lossy. Tokens do not reveal all the information we have at our disposal. In natural language, that's fine, because it's quite loose in structure.
But if our domain is heavily structured (like modern programming languages are), why reveal only snippets of linearised syntax of that structure to the model? Why not reveal the full structure we have at our disposal?
> and architectures that let you train on more data are better because they learn better underlying world models.
By this argument, wouldn't we conclude that training on chess using the game structure wouldn't work either, since that'd be a model that uses less data?
Less data is the point, isn't it?
https://cybercat.institute/2025/05/07/neural-alchemy/
https://cybercat.institute/2026/02/20/categorical-semantics-...
https://cybercat.institute/2025/10/16/dependent-optics-ii/
> The reason I put off starting the series for so long is one of the same reasons blocking the writing of the paper: some of the introductory material is some of the most difficult to write. It has been such a long time that I no longer know how to adequately explain why the problem is so difficult.
My sympathies to Jules
Sure, we can help the easy cases, but that's because they're easy to begin with. In general, we know (or at least assume) that being able to check a solution tractably does not make finding the solution tractable, or we'd know that NP = P. So if LLMs could effectively generate a proof that they've found the correct Integer -> Integer function, either that capability will be very limited or we've broken some known or assumed computational complexity limit. As Philippe Schnoebelen discovered in 2002 [1], languages cannot reduce the difficulty of program construction or comprehension.
Of course, it is possible that machine learning could learn some class of problems previously unknown to be in P and find that it is in P, but we should understand that that is what it's done: realised that the problem was easy to begin with rather than finding a solution to a hard problem. This is valuable, but we know that hard problems that are of great interest do exist.
[1]: https://lsv.ens-paris-saclay.fr/Publis/PAPERS/PDF/Sch-aiml02...
From a model-checking point of view. This is about taking a proof-theoretic approach...
Your last paragraph is also quite wrong: a machine learning could very well easily learn and solve an NP-complete problem, because this property does not say anything about average case complexity (and we should consider Probabilistic complexity classes, so the picture is even more "complex").
No. In complexity theory we deal with problems, and the model-checking problem is that of determining whether a program satisfies some property or not. If your logic is sound, you can certainly use an algorithm based on the logic's deductive theory (which could be type theory, but that's an unimportant detail) to decide the problem, but that can have no impact whatsoever on the complexity of the problem. The result applies to all decision procedures, be they model-theoretic or deductive (logic-theoretic).
> Your last paragraph is also quite wrong: a machine learning could very well easily learn and solve an NP-complete problem, because this property does not say anything about average case complexity
No. First, it's unclear what "average complexity" means here, but for any reasonable definition, the "average complexity" of NP-hard problems is not known to be tractable. Second, complexity theory approaches this issue (of "some instances may be easier") using parameterised complexity [1], and I'm afraid that the results for the model-checking problem - which, again, is the inherent difficulty of knowing what a program does regardless of how you do it - are not very good. I mentioned such a result in an old blog post of mine here [2]. (Parameterised complexity is more applicable than probabilistic complexity here because even if there were some reasonable distribution of random instances, it's probably not the distribution we'd care about.)
There is no escape from complexity limits, and the best we hope for is to find out that problems we're interested in have actually been easier than we thought all along. Of course, some people believe that the programs people actually write are somehow in a tractable complexity class that we've not been able to define - and maybe one day we'll discover that that's the case - but what we've seen so far suggests it isn't: If programs that people write are somehow easier to analyse, then we'd expect to see the size of programs we can soundly analyse grow at the same pace as the size of programs people write, and nothing can be further from what we've observed. The size of programs that can be "proven correct" (especially using deductive methods!) has remained largely the same for decades, while the size of programs people write has grown considerably over that period of time.
[1]: https://en.wikipedia.org/wiki/Parameterized_complexity
[2]: https://pron.github.io/posts/correctness-and-complexity#corr...
Maybe this set-up will work for Haskell, but you can abandon any hope of setting up environments for C or C++. Even languages like Rust or C# may be impossible to train on, despite the build chain being a bit nicer than C or C++.
Has the author written anything on how this applies to transformer architectures specifically? The attention mechanism seems like a place where a richer type theory would be genuinely useful.
I've implemented these in Idris 2: https://github.com/bgavran/TensorType/blob/main/src/NN/Archi...
Then it switches to using an encoding that would be more semantic, but I think the argument is a bit flimsy: it compares chess to the plethora of languages that LLM can spout somewhat correct code for (which is behind the success of this generally incorrect approach). What I found more dubious is that it brushed off syntactical differences to say "yeah but they're all semantically equivalent". Which, it seems to me, is kind of the main problem about this; basically any proof is an equivalence of two things, but it can be arbitrarily complicated to see it. If we consider this problem solved, then we can get better things, sure...
I think without some e.g. Haskell PoC showing great results these methods will have a hard time getting traction.
Please correct any inaccuracies or incomprehension in this comment!
On existing techniques - Type-Constrained Generation paper is discussed in the blog post (under Constrained Decoding), and I'd group typed holes in the same bucket.
The problem with those methods is that they're inference time: they don't update the weights. In this case, constrained decoding prevents the model from saying certain things, without changing what the model wants to say. This is especially problematic the more complex your type systems get, without even taking into account that type inference is for many of these undecidable.
Meaning, if I give you a starting string, in the presence of polymorphisms and lambdas you might not always be able to tell whether it completes to a term of a particular type.
On the syntactic difference: I'd gently reframe. The question isn't whether syntactically different programs are semantically equivalent, it's that regardless of which form you pick, the existing methods don't let the model learn the constructor choice.
That's what the next section is about.
>The problem with those methods is that they're inference time
I agree, I just thought it was missing some prior art (not affiliated with these papers :-P)
What is not clear to me at all is, is this the draft of a research idea? Or is there already some implementation coming in a later post?
It seems to me that such an idea would be workable on a given language with a given type system, but it seems to me there would be a black magic step to train a model that would work in a language-agnostic manner. Could you clarify?
> It seems to me that such an idea would be workable on a given language with a given type system, but it seems to me there would be a black magic step to train a model that would work in a language-agnostic manner.
That's correct. The blog post alludes to infrastructure building as a necessary component of making that happen for that exact reason. I.e. while it's "easy" to generate a dependent pair in this way, generating an entire dependently typed AST is much more difficult. On the positive side, this is more of a software engineering effort rather than a research one.
We can think of this coproduct as involving two choices:
1) a choice of which component of the coproduct we're interested in (first, second, or third)
2) a choice of an element of that component
That is, `A + B + C` is isomorphic to `(i : Fin 3 * D i)` where `Fin 3` is a set with three elements, and `D : Fin 3 - > Type` and `D(0)=A`, `D(1)=B`, `D(2)=C`.
Then, the idea is: why index by a finite set? If you replace `Fin 3` by some arbitrary set, you start to be able to model a very general notion of a dependent type.