Really fascinating how this works; it's basically context-aware decoding. From the paper:
> Code interleaves fork positions, where several continuations are genuinely plausible and may correspond to different solution approaches, with lock positions, where syntax and semantics leave little ambiguity but a low-probability distractor tail still remains… The best global decoding setting is therefore necessarily a compromise; we call this tension the precision-exploration conflict.
In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).
What this paper shows is that their simple technique (SSD) can improve the ranking of optimal tokens in both lock and fork positions, meaning the model is more likely to explore when it should be exploring, and more likely to be precise when it needs to be.
I love that we're still learning the emergent properties of LLMs!
user_7832 1 days ago [-]
> I love that we're still learning the emergent properties of LLMs!
TBH, this is (very much my opinion btw) the least surprising thing. LLMs (and especially their emergent properties) are still black boxes. Humans have been studying the human brain for millenia, and we are barely better at predicting how humans work (or for eg to what extent free will is a thing). Hell, emergent properties of traffic was not understood or properly given attention to, even when a researcher, as a driver, knows what a driver does. Right now, on the front page, is this post:
> 14. Claude Code Found a Linux Vulnerability Hidden for 23 Years (mtlynch.io)
So it's pretty cool we're learning new things about LLMs, sure, but it's barely surprising that we're still learning it.
(Sorry, mini grumpy man rant over. I just wish we knew more of the world but I know that's not realistic.)
AlphaAndOmega0 1 days ago [-]
I'm a psychiatry resident who finds LLM research fascinating because of how strongly it reminds me of our efforts to understand the human brain/mind.
I dare say that in some ways, we understand LLMs better than humans, or at least the interpretability tools are now superior. Awkward place to be, but an interesting one.
p1esk 1 days ago [-]
LLMs are orders of magnitude simpler than brains, and we literally designed them from scratch. Also, we have full control over their operation and we can trace every signal.
Are you surprised we understand them better than brains?
danielmarkbruce 1 days ago [-]
"Designed" is a bit strong. We "literally" couldn't design programs to do the interesting things LLMs can do. So we gave a giant for loop a bunch of data and a bunch of parameterized math functions and just kept updating the parameters until we got something we liked.... even on the architecture (ie, what math functions) people are just trying stuff and seeing if it works.
> We "literally" couldn't design programs to do the interesting things LLMs can do.
That's a bit of an overstatement.
The entire field of ML is aimed at problems where deterministic code would work just fine, but the amount of cases it would need to cover is too large to be practical (note, this has nothing to do with the impossibility of its design) AND there's a sufficient corpus of data that allows plausible enough models to be trained. So we accept the occasionally questionable precision of ML models over the huge time and money costs of engineering these kinds of systems the traditional way. LLMs are no different.
jeremyjh 5 hours ago [-]
It is impossible to design even in a theoretical sense if functional requirements consider matters such as performance and energy consumption. If you have to write petabytes of code you also have to store and execute it.
danielmarkbruce 1 days ago [-]
Saying ML is a field where deterministic code would work just fine conveniently leaves out the difficult part - writing the actual code.... Which we haven't been able to do for most of the tasks at hand.
What you are saying is fantasy nonsense.
astrange 23 hours ago [-]
They did not leave it out.
> but the amount of cases it would need to cover is too large to be practical (note, this has nothing to do with the impossibility of its design)
yunnpp 1 days ago [-]
> would work just fine, but the amount of cases it would need to cover is too large to be practical
So it doesn't work.
idiotsecant 24 hours ago [-]
And all you have to do is write an infinite amount of code to cover all possible permutations of reality! No big deal, really.
growpdifjkl 24 hours ago [-]
[flagged]
AlphaAndOmega0 23 hours ago [-]
I'm a psychiatry resident who has been into ML since... at least 2017. I even contemplated leaving medicine for it in 2022 and studied for that, before realizing that I'd never become employable (because I could already tell the models were getting faster than I am).
You would be sorely mistaken to think I'm utterly uninformed about LLM-research, even if I would never dare to claim to be a domain expert.
jeremyjh 1 days ago [-]
We've been studying brains a lot longer. LLMs are grown, not built. The part that is designed are the low-level architecture - but what it builds from that is incomprehensible and unplanned.
da_chicken 15 hours ago [-]
It's not that much longer, really.
LLMs draw origins from, both n-gram language models (ca. 1990s) and neural networks and deep learning (ca. 2000). So we've only had really good ones maybe 6-8 years or so, but the roots of the study go back 30 years at least.
Psychiatry, psychology, and neurology on the other hand, are really only roughly 150 years old. Before that, there wasn't enough information about the human body to be able to study it, let alone the resources or biochemical knowledge necessary to be able to understand it or do much of anything with it.
So, sure, we've studied it longer. But only 5 times longer. And, I mean, we've studied language, geometry, and reasoning for literally thousands of years. Markov chains are like 120 years old, so older than computer science, and you need those to make an LLM.
And if you think we went down some dead-end directions with language models in the last 30 years, boy, have I got some bad news for you about how badly we botched psychiatry, psychology, and neurology!
jeremyjh 5 hours ago [-]
You are still talking about low level infrastructure. This is like studying neurons only from a cellular biology perspective and then trying to understand language acquisition in children. It is very clear from recent literature that the emergent structure and behavior of LLMs is absolutely a new research field.
hellrich 8 hours ago [-]
Embedding „meaning“ in vector spaces goes back to 1950s structuralist linguistics and early information retrieval research, there is a nice overview in the draft for the 3rd edition of speech and language processing https://web.stanford.edu/~jurafsky/slp3/5.pdf
ctoth 1 days ago [-]
> Also, we have full control over their operation and we can trace every signal. Are you surprised we understand them better than brains?
Very, monsieur Laplace.
24 hours ago [-]
evilduck 1 days ago [-]
To be fair to your field, that advancement seems expected, no? We can do things to LLMs that we can't ethically or practically do to humans.
AlphaAndOmega0 23 hours ago [-]
I'm still impressed by the progress in interpretability, I remember being quite pessimistic that we'd achieve even what we have today (and I recall that being the consensus in ML researchers at the time). In other words, while capabilities have advanced at about the pace I expected from the GPT-2/3 days, mechanistic interpretability has advanced even faster than I'd hoped for (in some ways, we are very far from completely understanding the ways LLMs work).
bensyverson 1 days ago [-]
Learning about the emergent properties of these black boxes is not surprising, but it's also not daily. I think every new insight is worth celebrating.
user_7832 1 days ago [-]
Oh I very much agree that it's great to see more research and findings and improvements in this field. I'm just a little puzzled by GP's tone (which suggested that it isn't completely expected to find new things about LLMs, a few years in).
bensyverson 1 days ago [-]
I'm the GP! lol… Not sure how you got that from my tone, but I find these discoveries expected but not routine, and also interesting.
TeMPOraL 1 days ago [-]
Indeed. For me, it's also a good reminder that AI is here to stay as technology, that the hype and investment bubble don't actually matter (well, except to those that care about AI as investment vehicle, of which I'm not one). Even if all funding dried out today, even if all AI companies shut down tomorrow, and there are no more models being trained - we've barely begun exploring how to properly use the ones we have.
We have tons of low-hanging fruits across all fields of science and engineering to be picked, in form of different ways to apply and chain the models we have, different ways to interact with them, etc. - enough to fuel a good decade of continued progress in everything.
bathtub365 1 days ago [-]
AI has been here to stay for decades
TeMPOraL 1 days ago [-]
Maybe, but you couldn't tell that these days, casually scrolling this or any other tech-oriented discussion board.
ethin 1 days ago [-]
I mean... You could? AI comes in all kinds of forms. It's been around practically since Eliza. What is (not) here to stay are the techbros who think every problem can be solved with LLMs. I imagine that once the bubble bursts and the LLM hype is gone, AI will go back to exactly what it was before ChatGPT came along. After all, IMO it's quite true that the AIs nobody talks about are the AIs that are actually doing good or interesting things. All of those AIs have been pushed to the backseat because LLMs have taken the driver and passenger seats, but the AIs working on cures for cancer (assuming we don't already have said cure and it just isn't profitable enough to talk about/market) for example are still being advanced.
darkwater 1 days ago [-]
Saying that LLMs will disappear once the financial hype desinflate is like saying that LLMs are the answer to everything.
59nadir 14 hours ago [-]
Personally I read the GP post with more emphasis on this bit:
> What is (not) here to stay are the techbros who think every problem can be solved with LLMs.
LLMs are in all likelyhood here to stay, but the scumbags doing business around them right now are hopefully going away eventually.
darkwater 14 hours ago [-]
I agree on that part as well, but saying that AI will go back at what it was before ChatGPT came along is false. LLM will still be a standalone product and will be taken for granted. People will (maybe? hopefully?) eventually learn to use them properly and not generate tons of slop for the sake of using AI. Many "AI companies" will disappear from the face of Earth. But our reality has changed.
TeMPOraL 12 hours ago [-]
LLMs will not be just a standalone product. The models will continue to get embedded deep into software stacks, as they're already being today. For example, if you're using a relatively modern smartphone, you have a bunch of transformer models powering local inference for things like image recognition and classification, segmentation, autocomplete, typing suggestions, search suggestions, etc. If you're using Firefox and opted into it, you have local models used to e.g. summarize contents of a page when you long-click on a link. Etc.
LLMs are "little people on a chip", a new kind of component, capable of general problem-solving. They can be tuned and trimmed to specialize in specific classes of problems, at great reduction of size and compute requirements. The big models will be around as part of user interface, but small models are going to be increasingly showing up everywhere in computational paths, as we test out and try new use cases. There's so many low-hanging fruits to pick, we're still going to be seeing massive transformations in our computing experience, even if new model R&D stalled today.
amelius 1 days ago [-]
Studies of LLMs belong in their own field of science, just like psychology is not being studied in the physics department.
guelo 24 hours ago [-]
¸That field is called Machine Learning.
amelius 9 hours ago [-]
No that's still like putting cellular biology and psychology in the same bin.
osigurdson 1 days ago [-]
That is a very interesting thought!
littlestymaar 1 days ago [-]
Interestingly enough, for a while physics used to be studied by philosophers (and used to be put in the natural philosophy basket, together with biology and most other hard sciences).
zer00eyz 1 days ago [-]
The intersection of physics isnt psychology it is philosophy, and the same is true (at present) with LLM's
Much as Diogenes mocked Platos definition of a man with a plucked chicken, LLM's revealed what "real" ai would require: contigous learning. That isnt to diminish the power of LLM's (the are useful) but that limitation is a fairly hard one to over come if true AGI is your goal.
andai 1 days ago [-]
Is it because we haven't invented something better than backpropagation yet?
From what I understand, a living neural network learns several orders of magnitude more efficiently than an artificial one.
I'm not sure where that difference comes from. But my brain probably isn't doing back propagation, it's probably doing something very different.
astrange 23 hours ago [-]
Your brain is doing several different things, because there are different parts of your brain.
(eg different kinds of learning for long-term memory, short-term memory, languages, faces and reflexes.)
quantummagic 1 days ago [-]
What is "contigous" learning, and why is it a hard requirement of AGI?
amelius 1 days ago [-]
What do you mean by the intersection of physics?
The intersection of what with physics?
zer00eyz 1 days ago [-]
The intersection of disciplines.
Sir Roger Penrose, on quantum consciousness (and there is some regret on his part here) -- OR -- Jacob Barandes for a much more current thinking on this sort of intersectional exploratory thinking.
elbear 10 hours ago [-]
I thought it was determined (slight pun) that free will is not a thing. I'm referring to Sapolsky's book "Determined: A Science of Life Without Free Will)" as an example.
Invictus0 1 days ago [-]
To say we've been studying the brain for millennia is an extreme exaggeration. Modern neuroscience is only about 50 years old.
user_7832 1 days ago [-]
I hate to "umm, akshually" but apparently we have been studying the brain for thousands of years. I wasn't talking about purely modern neuroscience (which ironically for our topic of emergence, (often till recently/still in most places) treats the brain as the sum of its parts - be them neurons or neurotransmitters).
> The earliest reference to the brain occurs in the Edwin Smith Surgical Papyrus, written in the 17th century BC.
I was actually thinking of ancient greeks when writing my comment, but I suppose Egyptians have even older records than them.
None of that counts as studying the brain. It's like saying rubbing sticks together to make fire counts as studying atomic energy. Those early "researchers" were hopelessly far away from even the most tangential understanding of the workings of the brain.
timcobb 1 days ago [-]
I came here to say this :)
vova_hn2 1 days ago [-]
I've always thought that it is kinda weird that we spend exactly the same amount of compute to calculate both "fork" tokens and "lock" tokens.
I think that with grammar-aware sampling / constrained decoding [0][1] it is possible to sometimes skip calling the model altogether if only one token is allowed by grammar and just insert it, but I don't think that any of the current, widely used combinations of models/harnesses use it. And it only skips inference in rare edge cases.
I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.
Give coding agents access to intellisense and syntax highlighting.
Making coding agents spit out syntactically correct code token by token is like asking a human to code on a whiteboard.
vova_hn2 1 days ago [-]
Yeah, I was also thinking about it A LOT.
We kinda have a little bit of it with some coding harnesses giving model access to LSP, but I think that we can insert this knowledge on a lower level if we find a clever way to somehow utilize it during sampling.
I think that there is a lot of low hanging fruit in this area.
And in general, I think that people try to use LLMs too much to solve problems that can be easily solved by cheaper (computationally), and, more importantly deterministic tools.
For example, back in the day when LLM-assisted coding just became a thing people very often complained about models generating syntactically incorrect code and inventing non-existent library methods.
Well, I, an experienced human programmer, probably would also be making syntax mistakes and inventing non-existent methods if you stripped me of my tools and made me write code in a bare text editor without syntax highlighting.
Thankfully, my IDE would autocomplete real syntax and actually existing library methods for me and immediately give me feedback if I make a mistake anyway. And all of it is achieved using reliable deterministic code without the inherent issues of statistical models.
I think that it is really inefficient to reach for an expensive and unreliable tool when a cheap and reliable tool will do.
jwolfe 1 days ago [-]
In general these agents support LSPs, which is often as much information as your IDE will give you. They are also not required to output syntactically correct code token by token when running agentically, because the loop is:
1. code
2. syntax check / build / format / lint (details language dependent)
3. test
and they can hop between 1 and 2 however many times they want.
tadfisher 1 days ago [-]
Doing a tool call for autocomplete is not going to make coding agents faster.
I do think there is some merit in a tool that dumps all namespaces and reachable symbols so the agent can do its own autocomplete without a round-trip.
jameshart 23 hours ago [-]
Doesn’t need to be a tool call.
As a human coder you don’t summon intellisense. It’s just popped up into your visual field as extra input - contextual cues.
You could force intellisense state into the context vector the LLM receives.
foota 19 hours ago [-]
Not really, because the LLM loop doesn't have the ability to get updates from the agent live. It would have to somehow be integrated all the way down the stack.
jameshart 19 hours ago [-]
LLMs can have whatever abilities we build for them. The fact we currently start their context out with a static prompt which we keep feeding in on every iteration of the token prediction loop is a choice. We don’t have to keep doing that if there are other options available.
orbital-decay 12 hours ago [-]
You're describing structured outputs.
sgbeal 1 days ago [-]
> Give coding agents access to intellisense and syntax highlighting.
i once asked an LLM if it could ingest code from an interactive session more easily if it were in appropriately-typed markdown fences and it said absolutely yes, and that the syntax highlighting fed to it that way helps it immensely. i was downright shocked that syntax highlighting was anything more than noise for them.
devmor 1 days ago [-]
Why would this be surprising? That’s exactly how much of the code they were trained on is presented in PRs, Forums, etc.
astrange 23 hours ago [-]
Is that true? That depends on how their web scraping works, like whether it runs client-side highlighting, strips out HTML tags, etc.
devmor 18 hours ago [-]
The highlighting isn't what matters, its the pretext. E.g. An LLM seeing "```python" before a code block is going to better recall python codeblocks by people that prefixed them that way.
21 hours ago [-]
olejorgenb 22 hours ago [-]
> I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.
I think speculative decoding count as a (perhaps crude) way implementing this?
quotemstr 1 days ago [-]
> I wonder if there is a more general solution that can make models spend more compute on making important choices
There's a lot of work going on in various streams towards making it possible to vary compute per-token, dynamically, e.g. universal transformers. Maybe one day it'll work well enough to beat conventional techniques.
khalic 1 days ago [-]
Another example of the mindf@#$ these systems are: I was doing some fine tuning to a small model, take data fields and make a sentence out of it. I was running into mode collapse (basically when the AI simplifies too much and always output the same thing).
I got unstuck by randomizing the field order for each row?!? At training, and now I'm thinking I should do the same at inference time...
p_stuart82 1 days ago [-]
the irony of modern software engineering: we spent decades perfecting deterministic algorithms, and now we're basically just shaking a black box and hoping the magic rocks align.
darkhorse222 23 hours ago [-]
Quantum physics teaches us that at the fundamental levels of physics, reality itself is probabilistic. Probability distributions collapsing to discrete locations aligns nicely across LLMs and quantum mechanics.
khalic 1 days ago [-]
It's a little disturbing, but also very fun to just discover by probing, building and breaking.
astrange 23 hours ago [-]
This is an AI bot btw. (sarcasm, metaphor that doesn't make sense)
khalic 22 hours ago [-]
Me or the new account?
astrange 21 hours ago [-]
Not you!
khalic 10 hours ago [-]
oh good, I never know if my metaphors make sense :D
auspiv 1 days ago [-]
apparently you can straight up duplicate/add/rearrange layers without changing any of the weights and get better results as well - https://dnhkng.github.io/posts/rys/
khalic 1 days ago [-]
This is crazy, thank you for the link!
quotemstr 1 days ago [-]
Neat!
> This is probably due to the way larger numbers are tokenised, as big numbers can be split up into arbitrary forms. Take the integer 123456789. A BPE tokenizer (e.g., GPT-style) might split it like: ‘123’ ‘456’ ‘789’ or: ‘12’ ‘345’ ‘67’ ‘89’
xVal basically says "tokenizing numbers is hard: what if instead of outputting tokens that combine to represent numbers, we just output the numbers themselves, right there in the output embedding?"
It works! Imagine you're discussing math with someone. Instead of saying "x is twenty five, which is large" in words, you'd say "x is", then switch to making a whistling noise in which the pitch of your whistle, in its position within your output frequency range, communicated the concept of 25.00 +/- epsilon. Then you'd resume speech and say "which is large".
I think the sentiment is that today's models are big and well-trained enough that receiving and delivering quantities as tokens representing numbers doesn't hurt capabilities much, but I'm still fascinated by xVal's much more elegant approach.
khalic 24 hours ago [-]
I was having some issues with IP addresses representation, this might solve it
toddmorey 1 days ago [-]
wow that's fascinating
sinuhe69 6 hours ago [-]
“In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).”
I’d be very cautious of the phrase 'just like us'. Not only can anthropomorphism be misleading and make us see things where none exist, it can also befuddle us, especially when we don’t know much about ourselves.
stingraycharles 1 days ago [-]
Seems like this is true for not just code but for all content being generated? Albeit for code it’s more well-defined, but the fork / lock mechanism works for a lot more problem domains.
bensyverson 1 days ago [-]
That would seem intuitively true; it certainly applies to written language, where a clause could go off in another direction, but at other positions the correct grammar/syntax is unambiguous.
bryanrasmussen 1 days ago [-]
thinking -
well if we think of lock as happening in a narrative, then I think we can see there can be points where "everything you know is wrong" which essentially allows you to go back into a sort of fork mode and work towards another lock.
Completely artistic creation, creating something that does not exist and that cannot produce things out of itself, means that locking can be more diffuse, not as settled.
stingraycharles 1 days ago [-]
I think this seems similar to what Anthropic had been doing since the latest few Opus releases, which is interleaved thinking; CoT reasoning in the middle of a message. But they operate at different layers.
orbital-decay 1 days ago [-]
One relevant thing is that these forks are unnaturally narrow in all models, and rather resemble locks (not quite but close). From multiple possible continuations models tend to prefer just a couple, i.e. the model is a lot less random than it should be. That's why you're seeing annoying slop in writing and instantly recognizable color schemes in vibecoded sites. Lack of diversity probably limits the usefulness of this method as well.
>I love that we're still learning the emergent properties of LLMs!
There are tons of low-hanging fruits there.
p_stuart82 1 days ago [-]
it feels like the modern recurrence of the early 2010s bootstrap templates. we figured out how to automate building sites instantly, but at the cost of making the entire web look exactly the same.
Could we not get the same with EAFT? Maybe that’s what it’s doing but definitely not the first to think “let’s lock in high probability solutions”
In nemotron the high perplexity solutions are selected for RL, in VLM training a few people are looking at the entropy distributions of the training set, etc
sdwr 21 hours ago [-]
What's cool is that they aren't adjusting the temperature of the model live, or predicting/labeling any of the fork/lock points.
robocat 23 hours ago [-]
> In other words, just like us
I think you are implying a reverse causation. They used a metaphor from us.
michaelbuckbee 1 days ago [-]
I don't really understand the internal mechanics of of this, but my first thought was why not combine this with a linter/tests. So that it produces all the forks and only keeps the syntactically correct ones.
mrtesthah 1 days ago [-]
That’s going to be inefficient when most of the generations have broken syntax and can’t even parse.
TacticalCoder 1 days ago [-]
> What this paper shows is that their simple technique (SSD)
"Simple Self-Distillation". We had an acronym for Solid-State Drive. Don't know about that technique but the naming sure sound.. Simple?
wg0 1 days ago [-]
After TurboQuant and Gemma 4, came across the following video[0] running Gemma on local machine at 50 token/second.
That already looks like Sonnet 3x and 4 level capabilities to me where the model in question (Gemma 4) set ups whole python project with a UI and installs python libraries using uv etc.
Add this Simple Self Distillation to the picture and by 2028 I see cheaper coding model providers with much more generous usage limits in the future and power users would be mostly running their own models anyway.
Anyone using these models as "non-deterministic transpilers" from natural language to code (experienced engineers who can write code themselves) would probably not be paying to any AI providers.
I always wonder how much smaller and faster models could be if they were only trained on the latest versions of the languages I use, so for me that is PHP, SQL, HTML, JS, CSS, Dutch, English, plus tool use for my OS of choice (MacOS).
Right now it feels like hammering a house onto a nail instead of the other way around.
ACCount37 1 days ago [-]
Not very. LLMs derive a lot of their capability profile from the sheer scale.
LLMs have something that's not entirely unlike the "g factor" in humans - a broad "capability base" that spans domains. The best of the best "coding LLMs" need both good "in-domain training" for coding specifically and a high "capability base". And a lot of where that "base" comes from is: model size and the scale of data and compute used in pre-training.
Reducing the model scale and pruning the training data would result in a model with a lower "base". It would also hurt in-domain performance - because capabilities generalize and transfer, and pruning C code from the training data would "unteach" the model things that also apply to code in PHP.
Thus, the pursuit of "narrow specialist LLMs" is misguided, as a rule.
Unless you have a well defined set bar that, once cleared, makes the task solved, and there is no risk of scope adjustment, no benefit from any future capability improvements above that bar, and enough load to justify the engineering costs of training a purpose-specific model? A "strong generalist" LLM is typically a better bet than a "narrow specialist".
In practice, this is an incredibly rare set of conditions to be met.
weitendorf 1 days ago [-]
It's more complicated than that. Small specialized LLMS are IMO better framed as "talking tools" than generalized intelligence. With that in mind, it's clear why something that can eg look at an image and describe things about it or accurately predict weather, then converse about it, is valuable.
There are hardware-based limitations in the size of LLMs you can feasibly train and serve, which imposes a limit in the amount of information you can pack into a single model's weights, and the amount of compute per second you can get out of that model at inference-time.
My company has been working on this specifically because even now most researchers don't seem to really understand that this is just as much an economics and knowledge problem (cf Hayek) as it is "intelligence"
It is much more efficient to strategically delegate specialized tasks, or ones that require a lot of tokens but not a lot of intelligence, to models that can be served more cheap. This is one of the things that Claude Code does very well. It's also the basis for MOE and some similar architectures with a smarter router model serving as a common base between the experts.
BarryMilo 1 days ago [-]
I seem to remember that's one of the first things they tried, but the general models tended to win out. Turns out there's more to learn from all code/discussions than from just JS.
justinlivi 23 hours ago [-]
From my own empirical research, the generalized models acting as specialists outperform both the tiny models acting as specialists and the generalist models acting as generalists. It seems that if peak performance is what you're after, then having a broad model act as several specialized models is the most impactful.
Someone1234 1 days ago [-]
Wouldn't that mean they're bad at migration tasks? I feel like for most languages, going from [old] to [current] is a fairly to very common usage scenario.
rixed 13 hours ago [-]
The analogy with human brains suggests that it would not end very well.
nareyko 1 days ago [-]
[dead]
red75prime 1 days ago [-]
> power users would be mostly running their own models
...with a fair amount of supervision, while frontier models would be running circles around them using project-specific memory and on-demand training (or whatever we would have by then).
3abiton 1 days ago [-]
Honestly right now it's mainly stagnation in frontiere model capabilities. Most of the recent afvancemdnts are towards generation speed, compression and tool usage. The quality of the models are not improving at the same rate as before. I doubt this big gap will continue, given that open source and especially chinese labs keep pushing well documented frontiere papers.
darkerside 1 days ago [-]
Those will be great for projects that look just like everybody else's. That's not a knock. We'll see plenty of new systems built by anyone who needs one.
If you're building something groundbreaking and new, the advantage will be slim to none.
littlestymaar 1 days ago [-]
If what you refer to by “on demand training ” is fine tuning, it's going to be much more efficient on a small model than a big one.
red75prime 1 days ago [-]
LoRA can work with big models. But I mean sample-efficient RL.
teleforce 20 hours ago [-]
It seems that self-distillation is the way to go for LLM.
Self-distillation has been shown recently as very efficient and effective back in January this year by MIT and ETH team in their Self-Distillation Fine-Tuning (SDFT) LLM system [1],[2].
This paper is also their closest competitor named On-Policy Self-Distillation in the comparison table.
I hope they keep the original work real name that is Self-Distillation Fine-Tuning or SDFT. Imagine later paper citing this very paper as cross-entropy self-distillation instead of their very own given name Simple Self-Distillation or SSD. Although I'd have admitted it's a lousy name that breaks the namespace with common SSD nomenclature for solid-dtate drive, as others have rightly pointed.
I think they should given the proper credit to this earlier seminal earlier on SDFT but apparently they just put it as one as of the systems in their benchmark but not explaining much of the connection and lineage which is a big thing in research publication.
Their explanation for why their idea (SSD) might work - precision-exploration conflict hypothesis - is something adaptive decoding also tries to solve.
I've been wondering about adaptive decoding! It seems obvious to me that at some points during decoding (reasoning, "creative thinking") you would want a higher temperature, while at other points (emitting syntactically correct code, following a plan that was already established) you would want lower temperature.
uduni 1 days ago [-]
It's crazy how much better you can make LLM output just by asking "is this the most elegant solution?" In a loop
(Not fine tuning, but interesting none the less. If a model can so easily find a more elegant solution, why didn't it pick that in the first place?)
noman-land 22 hours ago [-]
The elegant solution rarely happens on the first try. Many times you need to first arrive at a solution, and then keep iterating on it until it's elegant. Akin to "sorry I didn't have time to write a shorter letter".
suzzer99 1 days ago [-]
IME human developers also span a spectrum on this. On one end, you have devs who might meditate half a day on different solutions before writing a line of code. On the other end are devs who run full speed ahead with the first working solution that comes to mind. LLMs in their current form are mostly the latter.
jditu 22 hours ago [-]
[dead]
khalic 1 days ago [-]
Incredible, will translate to better coding models in the near future.
We really need to develop better tools to understand what's happening inside these NNs. Working with high-D spaces is not something we're good at, and we're basically throwing stuff at it and seeing if it sticks.
0x3f 1 days ago [-]
Haven't read the paper yet, but it is interesting how seemingly simple many breakthroughs in ML are. Even transformers are like that. Maybe it's hindsight bias.
I suppose we just don't have a deeper underlying theory to lean on and help us 'design' anything.
christophilus 1 days ago [-]
A lot of discoveries are like that. In fact, simplicity is often the hallmark of correctness, and complexity is often a sign that our understanding is incomplete and we’re still stumbling towards the right model. Not always, but often. It’s been a good rule of thumb in my programming career.
heeton 1 days ago [-]
100%. I have a guiding approach when solving problems: keep reframing and exploring until the solution becomes obvious.
I often find, if I've got a complicated solution, it’s because I haven’t fully examined the problem.
Teever 1 days ago [-]
A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. -- Antoine de Saint-Exupery
GandalfHN 1 days ago [-]
[flagged]
GandalfHN 1 days ago [-]
[flagged]
GandalfHN 1 days ago [-]
[flagged]
ultramann 1 days ago [-]
Maybe not the thing I should be focusing on, but I was surprised this paper came from apple. I was under the impression that apples ai/LLM research was far behind the curve. I get that research is a rising tides lifts all boats situation, I just thought that I had seen lots of negative news about apples progress in the front, and heuristically haven’t seen many (any?) apple research papers make it the front page of hacker news. Wondering if anyone more familiar with apple/ai research could comment on this?
bensyverson 1 days ago [-]
Apple routinely makes hn's front page for their AI research [0][1], particularly related to their work with small on-device models.
It’s so ironic that Apple still publishes AI research and OpenAI does not.
dhruv3006 1 days ago [-]
I find it ironic too - there was no need for OpenAI to not publish really.
michaelcampbell 1 days ago [-]
They have no marketplace to religiously defend for it...yet.
drdrek 23 hours ago [-]
This is the "Factors" Bonanza in finance all over again. You get a generally useful model, then you over-fit it to some criteria and announce advancement in the field, then it performs worse in real life. New infinite academic article glitch just dropped boys!
namuol 17 hours ago [-]
> sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning
It’s all moonspeak to me. I tried reading other comments that explain this and they all sounded different or contradictory. I’ve studied ML as a hobby years ago but this was before the LLM explosion. Guess I need to start over again?
l5870uoo9y 1 days ago [-]
> Our method, simple self-distillation (SSD), is embarrassingly simple: sample solutions from the base model with specified temperature and truncation, then fine-tune on those raw, unverified samples via standard cross-entropy loss.
So you prompt the base model for answer and then rerun the prompt with the answer from the first run?
ACCount37 1 days ago [-]
No. There's no "answer" really.
They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.
This effectively "folds" the logit tail truncation behavior into the model itself.
Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.
zug_zug 1 days ago [-]
Yeah basically.
You use the outputs from the first run (right or wrong) as answers for the second training run, and repeat. Magically it works. That's what's so surprising.
I guess a theory is because there are so many diverse ways to be wrong that they don't accumulate error... still seems surprising and would be interesting to see if it works in other domains.
EdNutting 11 hours ago [-]
How is this not equivalent to training the model on the test data set? Yes it performs better at generating code for the target problems, but seemingly by becoming more tuned to the specific context of those problems (“context aware”), which suggests to me it would not generalise to real-world usage?
joshuaisaact 12 hours ago [-]
This was a really interesting paper but there's a massive gap in what they didn't try, which is inference-time temperature changes based on the fork/lock distinction.
Maybe I'll try that myself, because it feels like it could be a great source of improvements. It would be really useful to see adaptive per-token sampling as an additional decode-only baseline.
tatrions 1 hours ago [-]
Token-level entropy is a pretty clean proxy for detecting fork vs lock positions at inference time. Low entropy = lock (decode greedily), high entropy = fork (sample with more temperature). Speculative decoding already exploits something similar where the small draft model handles the predictable tokens and the big model kicks in at the uncertain ones. Combining that with this paper's fork/lock framing could get you adaptive temperature basically for free during inference.
grumbelbart 10 hours ago [-]
Is this some kind of calibration then? I'd expect that the probabilities automatically adjust during training, such that in "lock" mode, for example, syntax-breaking tokens have a very low probability and would not be picked even wich higher temperature.
vishnugupta 1 days ago [-]
Can someone please eli5 this to a friend web developer? I read the abstract but couldn’t understand much.
unknownx113 1 days ago [-]
you're probably overcomplicating it; as the paper says, it's embarrassingly simple: given a problem set, generate a response for each problem with a fixed temperature and truncation - then fine tune the model on the generations.
Their hypothesis as to why this works requires a bit more knowledge about model architecture, but basically when a model generates code some positions have only one right answer and some have many valid options - but the model has to use one global confidence setting for both. Sampling with a specific temperature + a garbage-token filter, then training on those outputs, teaches the model to internalize 'be precise where there's one answer, stay open-minded where there are several' — without anyone labeling which is which.
Note that there's a lot more nuance to this and I simplified a lot.
zug_zug 1 days ago [-]
ELI 5
You teach the machine by asking it to solve some problems, and then whatever answer it gives you say "That's exactly right. Now we train on those answers YOU just gave me" (even if they are wrong) and repeat. Somehow THAT works over time.
useful 1 days ago [-]
if the probability mass is on a single token, its a precise answer like `1 + 1 = `
if next token predicted shares probability with other token, then there are multiple answers like `position: `
you can generate and train answers by exploring on varying the length of the code generated
itmitica 1 days ago [-]
It’s an interesting claim, and the reported benchmark gains are large, but it is still an April 1, 2026 arXiv preprint, so I’d treat it as promising rather than settled.
hooloovoo_zoo 22 hours ago [-]
One sentence summary: We fine-tuned a general-purpose model to produce valid benchmark code results and it got better at producing benchmark code results; we didn't bother to evaluate it on anything the model used to be good at.
andy_xor_andrew 22 hours ago [-]
Not really? If you read it, there is no validation, no correctness signal, no verification, none of that. They're just passing in benchmark inputs, collecting the outputs (regardless of their quality), training on those outputs, and then sweeping the decode settings (temp, topk) of the resulting model. Their conclusion is that this results in a better model than the original - even when taking into consideration the same temp/topk sweep of the original.
So no, they are not fine-tuning a general purpose model to produce "valid benchmark code results."
fpgaminer 20 hours ago [-]
Not only that, they additionally ran an experiment with the training temperature turned way up (2.0) and truncation turned off such that the majority of SFT examples were incoherent (63% IIRC). Yet the model finetuned on these broken examples still improved over baseline.
hooloovoo_zoo 22 hours ago [-]
They are training the model to
1. Produce code (as opposed to answer a question, write a poem, etc.)
2. Produce long enough output to be a valid solution.
So they are doing exactly what I said. Cheers.
mememememememo 22 hours ago [-]
In layman, they are putting wet tyres on when it is raining and saying the car performs better over the next lap?
roger_ 1 days ago [-]
Skimmed this but don't have an intuitive understanding of why this works and how temperature and truncation factor in.
an0malous 1 days ago [-]
I’d like to understand AI research better and I recall some posts a while back where someone collected all the key papers that one should read, but I don’t remember enough to be able to find it. Does anyone know what I’m talking about and could link me to that post?
zug_zug 1 days ago [-]
This might sound paradoxical -- but any decent LLM will be happy to explain all the papers to you at great depth, and read new ones, and translate the math into simpler concepts and such. It'll also happily recommend relevant math to study, or give training problems, or whatever you want.
So... it's like a golfer who hits thousands of balls into an open field without ever once aiming for a hole. The relentless repetition flawlessly locks in their foundational muscle memory and basic swing mechanics, so when they finally step up to a real course, they don't have to waste a single thought on how to hold the club. Their basic swing is completely automatic - they can confidently take the creative, high-risk shot required to actually sink a hole-in-one.
thunky 8 hours ago [-]
> The relentless repetition flawlessly locks in their foundational muscle memory and basic swing mechanics
If only this were true there wouldn't be an army of duffers who after a lifetime of "training" still dig a trench in front of the ball every time they play.
mickdarling 1 days ago [-]
I'm working on a tool to determine which portions of an LLM process can be optimized, and how to measure that optimization and check whether it's optimizable at all. The shaping pattern that they talk about here is directly relevant and makes a whole lot more processes potentially optimizable by looking at the pattern rather than if the metrics just go up or down.
gavinray 1 days ago [-]
Why have we been fed the narrative that training models on their own output progressively degrades quality?
It's the first thing anyone would think of (like a self-hosted compiler) but everything I've read said "it doesn't work."
EDIT: For context:
> Shumailov et al. (2024) — "AI models collapse when trained on recursively generated data" (Nature, 2024)
dwa3592 1 days ago [-]
Can anyone help clarify these doubts - I didn't see any information about how different the test/benchmark set is from the training set. It feels like an important gap to not fill in a ML paper. What if there is an overlap between the problems in the test set and the training set?? What is the decontamination strategy of going from LCBv5 to LCBv6 ?
Lerc 22 hours ago [-]
This is the natural conclusion of what was really claimed about model collapse, and indeed natural evolution. Making an imperfect copy while invoking a selection mechanism is evolution.
Some of the claims about models training on their own data, in their enthusiasm to frame it as a failure, went further to suggest that it magnified biases. I had my doubts about their conclusions. If it were true, it would be a much greater breakthrough because the ability to magnify a property represents a way to measure a weak version that property. The ability to do that would mean they would have found a way to provide a training signal to avoid bias. It would be great if that's what they did but I suspect there would have been more news about it.
Perhaps this paper will put to rest the notion that AI output is useless as training data. It has only ever been the case that it was useless as an indiscriminate source of data.
try-working 20 hours ago [-]
most codebases dont have traces to train on. if you use rlm-workflow you will build up rich traceability in the form of requirements, plans, implementation artifacts, along with worktree diffs. with these, you can then use self-distillation on models or use autoagent to improve your harness. https://github.com/doubleuuser/rlm-workflow
crustycoder 1 days ago [-]
"SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6"
I know virtually nothing about this area but my naive take is that something that means it still only passes tests around half the time doesn't seem like a particularly big jump forwards.
What am I missing?
SEMW 1 days ago [-]
There's no shortage of benchmarks (coding or otherwise) that any competent coding model will now pass with ~100%.
But no-one quotes those any more because if everyone passes them, they don't serve any useful purpose in discriminating between different models or identifying advancements
So people switch to new benchmarks which either have more difficult tasks or some other artificial constraints that make them in some way harder to pass, until the scores are low enough that they're actually discriminating between models. and a 50% score is in some sense ideal for that - there's lots of room for variance around 50%.
(whether the thing they're measuring is something that well correlates to real coding performance is another question)
So you can't infer anything in isolation from a given benchmark score being only 50% other than that benchmarks are calibrated to make such scores the likely outcome
crustycoder 24 hours ago [-]
So it's the relative and not the absolute diff that matters - thanks.
martinrolph 12 hours ago [-]
Think of it less like a test suite and more like an exam. If you're trying to differentiate between the performance of different people/systems/models, you need to calibrate the difficulty accordingly.
When designing a benchmark, a pass rate of roughly 50% is useful because it gives you the most information about the relative performance of different models. If the pass rate is 90%+ too often, that means the test is too easy: you're wasting questions asking the model to do things we already know it can do, and getting no extra information. And if it's too low then you're wasting questions at the other end, trying to make it do impossible tasks.
xbmcuser 1 days ago [-]
So the chances of Singularity went up.
hu3 1 days ago [-]
Or down if this research leads to a local minima.
1 days ago [-]
fooker 1 days ago [-]
I'm excited for the long tail of techniques like this that are going to be discovered over the next several decades that's going to make this technology eventually run on a toaster!
drooby 1 days ago [-]
Fascinating...
This feels eerily similar to sleep consolidation or synaptic pruning
ACCount37 1 days ago [-]
I don't see much similarity? Unless you're looking at self-distillation in general and not just this use of it.
oliver236 1 days ago [-]
How not?
I think the analogy is actually pretty specific to this paper, not just self-distillation in general.
During sleep your brain replays experiences but noisy and distorted. The replays are often incoherent as narratives (dreams are weird). But the consolidation still works because the value isn't in the narrative coherence, it's in the activation patterns at each moment. Important pathways get strengthened, weak ones get pruned. Section 4.4 of this paper is what makes the connection click. They cranked training temperature to 2.0 with no truncation. 62% of the sampled outputs had no extractable code. Coherent Python that devolves into multilingual gibberish halfway through. The model still improved (+5.7pp pass@1).
This makes no sense if you think the model is learning from good code examples. But it makes a lot of sense if you think of it as the model replaying its own knowledge back to itself in a noisy/distorted form, and the replay process strengthening what matters (sharp distributions at "lock" positions where one token is correct, broad distributions at "fork" positions where multiple approaches work) while pruning what doesn't (distractor tails). The model doesn't learn anything new. It just wakes up performing better because what it already knew got cleaned up.
How is this comment not at number 1??
ACCount37 1 days ago [-]
This is a property of self-distillation.
Self-distillation shifts the behavior of the model towards that of the model + steering. As such, you don't strictly "need" the tokens to be in-domain for it to work. The logits are a vessel for transferring the steering into the model's internals.
The tokens can be gibberish. What transfers isn't whether they're gibberish or not, but how the flavor of model predictions, if given gibberish, differs from that of an unsteered version of itself.
In this specific case, the behavioral difference comes from the "temperature-shifted, truncated samples" in the "teacher" sampling strategy, and it is that difference that is internalized by the "student" model.
drooby 22 hours ago [-]
I think we’re agreeing. The point of the sleep parallel is exactly that the content doesn’t matter, and it’s the filtering process that does the work. Brains replay noisy, sometimes incoherent patterns during sleep and the value is in how that replay reshapes connection weights, not in whether the replay is accurate. That’s the same principle you’re describing with the steering signal
I.e sleep replays don’t need to replay Tuesday’s meeting accurately. They just need to activate the relevant pathways so that the strong ones fire and the weak ones don’t. The pattern of what fires versus what doesn’t is the signal. The “content” of the dream is basically irrelevant.
augment_me 1 days ago [-]
Isn't this was DeepSeek + Kimi did to Claude?
smallerize 1 days ago [-]
I don't suppose they published the improved models?
4b11b4 1 days ago [-]
Self-consistency meets fine-tuning?
hackermeows 21 hours ago [-]
what is the big deal with obsidian ? I see a lot of people use it but I'm more than happy with giving an LLM a local sqlite table , embedding api and asking the agent to maintain its own memory
antirez 1 days ago [-]
Another potentially usable trick is the following: based on the observation that longer token budget improves model performances, one could generate solutions using a lot of thinking budget, then ask the LLM to turn the trace into a more compact one, and later SFT on that. That said, I have the feeling the result of the paper will likely be hard to apply in practice without affecting other capabilities, and/or not superior to other techniques that provide similar improvement in sampling.
robwwilliams 1 days ago [-]
Very cool. An evolutionary biologist would say: Welcome to the party!
Mutation rate modulation is the AI engineers’ heat. And selection does the trimming of the outliers.
Some more serious biomorphic thinking and we may get to the next big insight courtesy of 3+ billion years of evolution—- evolution that enabled a great ape species to write a paper like this and build LMM’s like Gemma4 that totally rock on a 3.5 pound MacBookPro M5 Max with 128 GB of RAM.
hnretards 23 hours ago [-]
I've been doing something even better than this for years using only Mistral 7b.
My local running Mistral 7b is a 100x better at modern JavaScript than any model on the market, mainly just from RAG on my own code samples.
That's basically what they are describing with "post-training", the TLDR is that code especially of a certain style is vastly simpler than written language.
You really don't need a huge model or data centers etc. you just need a small but good model like Mistral 7b and literally a few good samples.
But you guys keep doing you lol. A bunch of non-devs trying to solve code is pretty funny to watch.
naasking 19 hours ago [-]
It's interesting that LLMs improve skills, especially on harder problems, just by practicing them. That's effectively what's going on.
porridgeraisin 1 days ago [-]
There's an obvious baseline which seems missing
If you sample from the base model with T=1.6, top_k=20, top_p=0.8, i.e, the decode settings used for the distillation's ground truth, does it match the SSD'd model + some decoding? Performance wise.
Their sweep is missing this. And only covers "standard" decoding settings.
LeonTing1010 11 hours ago [-]
[dead]
Sim-In-Silico 19 hours ago [-]
[dead]
techpulselab 1 days ago [-]
[dead]
neuzhou 20 hours ago [-]
[dead]
aplomb1026 21 hours ago [-]
[dead]
VoqalAI 1 days ago [-]
[dead]
maxbeech 23 hours ago [-]
[dead]
aplomb1026 1 days ago [-]
[dead]
pithtkn 1 days ago [-]
[dead]
usermac 1 days ago [-]
[dead]
aiiaro 1 days ago [-]
[flagged]
yubainu 1 days ago [-]
[dead]
dist-epoch 1 days ago [-]
[flagged]
avaer 1 days ago [-]
I definitely pay more attention to papers affiliated with Chinese companies; the economics seem to be more conducive to doing good academic work and publishing it. I would say the same for companies like Apple (where TFA came from).
But to filter based on author's names sounds pretty darn racist.
ptidhomme 1 days ago [-]
I used to have the opposite rule in my signal processing field : the more Chinese names, the less innovation was there.
They seemed like they had to be churning out papers and any little adaptation to existing research triggered a new publication.
But it may have changed now.
0x3f 1 days ago [-]
That's... almost every AI paper.
1 days ago [-]
amelius 1 days ago [-]
So
"Made in China, designed by Apple in California"
should be:
"Made in China, designed by Chinese people in California"?
jofzar 1 days ago [-]
> simple self-distillation (SSD):
Sorry apple, SSD is already taken, you can't use that acronym.
love2read 1 days ago [-]
You're right, I offer these alternatives:
Consistency Preservation Update (CPU)
Guided Probability Update (GPU)
History-aware Distillation Driving (HDD)
Probability Smoothing Update (PSU)
drittich 1 days ago [-]
I used to invent TLAs on the spot for fun, and when someone asked what it was, would respond, "It's a PUA", eventually revealing that meant "previously unknown acronym". It was even more annoying that it sounds.
ape4 1 days ago [-]
ATT=All TLAs are Taken
politelemon 1 days ago [-]
It's cringe worthy to see that the original paper itself is editorialised.
Title should be: Simple Self-Distillation Improves Code Generation
StevenWaterman 1 days ago [-]
"Embarrassingly" has a history as a technically meaningful word roughly equivalent to "maximally", see "Embarrassingly parallel"
Objective one should be to communicate effectively, not confuse everybody.
unknownx113 1 days ago [-]
that disqualifies like 80% of papers lmao
mikkupikku 1 days ago [-]
Lol, you're probably not wrong. But have you ever noticed that the most important papers tend to be on the clear and readable side of things? It's as if researchers understand that being understood is important, but deemphasize that when the paper itself isn't important in the first place. (Maybe if they're only publishing to not perish, not being understood is actually a goof thing from their perspective?)
Rendered at 20:09:01 GMT+0000 (Coordinated Universal Time) with Vercel.
> Code interleaves fork positions, where several continuations are genuinely plausible and may correspond to different solution approaches, with lock positions, where syntax and semantics leave little ambiguity but a low-probability distractor tail still remains… The best global decoding setting is therefore necessarily a compromise; we call this tension the precision-exploration conflict.
In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).
What this paper shows is that their simple technique (SSD) can improve the ranking of optimal tokens in both lock and fork positions, meaning the model is more likely to explore when it should be exploring, and more likely to be precise when it needs to be.
I love that we're still learning the emergent properties of LLMs!
TBH, this is (very much my opinion btw) the least surprising thing. LLMs (and especially their emergent properties) are still black boxes. Humans have been studying the human brain for millenia, and we are barely better at predicting how humans work (or for eg to what extent free will is a thing). Hell, emergent properties of traffic was not understood or properly given attention to, even when a researcher, as a driver, knows what a driver does. Right now, on the front page, is this post:
> 14. Claude Code Found a Linux Vulnerability Hidden for 23 Years (mtlynch.io)
So it's pretty cool we're learning new things about LLMs, sure, but it's barely surprising that we're still learning it.
(Sorry, mini grumpy man rant over. I just wish we knew more of the world but I know that's not realistic.)
I dare say that in some ways, we understand LLMs better than humans, or at least the interpretability tools are now superior. Awkward place to be, but an interesting one.
Are you surprised we understand them better than brains?
That's a bit of an overstatement.
The entire field of ML is aimed at problems where deterministic code would work just fine, but the amount of cases it would need to cover is too large to be practical (note, this has nothing to do with the impossibility of its design) AND there's a sufficient corpus of data that allows plausible enough models to be trained. So we accept the occasionally questionable precision of ML models over the huge time and money costs of engineering these kinds of systems the traditional way. LLMs are no different.
What you are saying is fantasy nonsense.
> but the amount of cases it would need to cover is too large to be practical (note, this has nothing to do with the impossibility of its design)
So it doesn't work.
You would be sorely mistaken to think I'm utterly uninformed about LLM-research, even if I would never dare to claim to be a domain expert.
LLMs draw origins from, both n-gram language models (ca. 1990s) and neural networks and deep learning (ca. 2000). So we've only had really good ones maybe 6-8 years or so, but the roots of the study go back 30 years at least.
Psychiatry, psychology, and neurology on the other hand, are really only roughly 150 years old. Before that, there wasn't enough information about the human body to be able to study it, let alone the resources or biochemical knowledge necessary to be able to understand it or do much of anything with it.
So, sure, we've studied it longer. But only 5 times longer. And, I mean, we've studied language, geometry, and reasoning for literally thousands of years. Markov chains are like 120 years old, so older than computer science, and you need those to make an LLM.
And if you think we went down some dead-end directions with language models in the last 30 years, boy, have I got some bad news for you about how badly we botched psychiatry, psychology, and neurology!
Very, monsieur Laplace.
We have tons of low-hanging fruits across all fields of science and engineering to be picked, in form of different ways to apply and chain the models we have, different ways to interact with them, etc. - enough to fuel a good decade of continued progress in everything.
> What is (not) here to stay are the techbros who think every problem can be solved with LLMs.
LLMs are in all likelyhood here to stay, but the scumbags doing business around them right now are hopefully going away eventually.
LLMs are "little people on a chip", a new kind of component, capable of general problem-solving. They can be tuned and trimmed to specialize in specific classes of problems, at great reduction of size and compute requirements. The big models will be around as part of user interface, but small models are going to be increasingly showing up everywhere in computational paths, as we test out and try new use cases. There's so many low-hanging fruits to pick, we're still going to be seeing massive transformations in our computing experience, even if new model R&D stalled today.
Much as Diogenes mocked Platos definition of a man with a plucked chicken, LLM's revealed what "real" ai would require: contigous learning. That isnt to diminish the power of LLM's (the are useful) but that limitation is a fairly hard one to over come if true AGI is your goal.
From what I understand, a living neural network learns several orders of magnitude more efficiently than an artificial one.
I'm not sure where that difference comes from. But my brain probably isn't doing back propagation, it's probably doing something very different.
(eg different kinds of learning for long-term memory, short-term memory, languages, faces and reflexes.)
The intersection of what with physics?
Sir Roger Penrose, on quantum consciousness (and there is some regret on his part here) -- OR -- Jacob Barandes for a much more current thinking on this sort of intersectional exploratory thinking.
> The earliest reference to the brain occurs in the Edwin Smith Surgical Papyrus, written in the 17th century BC.
I was actually thinking of ancient greeks when writing my comment, but I suppose Egyptians have even older records than them.
From https://en.wikipedia.org/wiki/History_of_neuroscience
I think that with grammar-aware sampling / constrained decoding [0][1] it is possible to sometimes skip calling the model altogether if only one token is allowed by grammar and just insert it, but I don't think that any of the current, widely used combinations of models/harnesses use it. And it only skips inference in rare edge cases.
I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.
[0] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...
[1] https://developers.redhat.com/articles/2025/06/03/structured...
Making coding agents spit out syntactically correct code token by token is like asking a human to code on a whiteboard.
We kinda have a little bit of it with some coding harnesses giving model access to LSP, but I think that we can insert this knowledge on a lower level if we find a clever way to somehow utilize it during sampling.
I think that there is a lot of low hanging fruit in this area.
And in general, I think that people try to use LLMs too much to solve problems that can be easily solved by cheaper (computationally), and, more importantly deterministic tools.
For example, back in the day when LLM-assisted coding just became a thing people very often complained about models generating syntactically incorrect code and inventing non-existent library methods.
Well, I, an experienced human programmer, probably would also be making syntax mistakes and inventing non-existent methods if you stripped me of my tools and made me write code in a bare text editor without syntax highlighting.
Thankfully, my IDE would autocomplete real syntax and actually existing library methods for me and immediately give me feedback if I make a mistake anyway. And all of it is achieved using reliable deterministic code without the inherent issues of statistical models.
I think that it is really inefficient to reach for an expensive and unreliable tool when a cheap and reliable tool will do.
1. code
2. syntax check / build / format / lint (details language dependent)
3. test
and they can hop between 1 and 2 however many times they want.
I do think there is some merit in a tool that dumps all namespaces and reachable symbols so the agent can do its own autocomplete without a round-trip.
As a human coder you don’t summon intellisense. It’s just popped up into your visual field as extra input - contextual cues.
You could force intellisense state into the context vector the LLM receives.
i once asked an LLM if it could ingest code from an interactive session more easily if it were in appropriately-typed markdown fences and it said absolutely yes, and that the syntax highlighting fed to it that way helps it immensely. i was downright shocked that syntax highlighting was anything more than noise for them.
I think speculative decoding count as a (perhaps crude) way implementing this?
There's a lot of work going on in various streams towards making it possible to vary compute per-token, dynamically, e.g. universal transformers. Maybe one day it'll work well enough to beat conventional techniques.
I got unstuck by randomizing the field order for each row?!? At training, and now I'm thinking I should do the same at inference time...
> This is probably due to the way larger numbers are tokenised, as big numbers can be split up into arbitrary forms. Take the integer 123456789. A BPE tokenizer (e.g., GPT-style) might split it like: ‘123’ ‘456’ ‘789’ or: ‘12’ ‘345’ ‘67’ ‘89’
One of the craziest LLM hacks that doesn't get love is https://polymathic-ai.org/blog/xval/
xVal basically says "tokenizing numbers is hard: what if instead of outputting tokens that combine to represent numbers, we just output the numbers themselves, right there in the output embedding?"
It works! Imagine you're discussing math with someone. Instead of saying "x is twenty five, which is large" in words, you'd say "x is", then switch to making a whistling noise in which the pitch of your whistle, in its position within your output frequency range, communicated the concept of 25.00 +/- epsilon. Then you'd resume speech and say "which is large".
I think the sentiment is that today's models are big and well-trained enough that receiving and delivering quantities as tokens representing numbers doesn't hurt capabilities much, but I'm still fascinated by xVal's much more elegant approach.
I’d be very cautious of the phrase 'just like us'. Not only can anthropomorphism be misleading and make us see things where none exist, it can also befuddle us, especially when we don’t know much about ourselves.
Completely artistic creation, creating something that does not exist and that cannot produce things out of itself, means that locking can be more diffuse, not as settled.
>I love that we're still learning the emergent properties of LLMs!
There are tons of low-hanging fruits there.
In nemotron the high perplexity solutions are selected for RL, in VLM training a few people are looking at the entropy distributions of the training set, etc
I think you are implying a reverse causation. They used a metaphor from us.
"Simple Self-Distillation". We had an acronym for Solid-State Drive. Don't know about that technique but the naming sure sound.. Simple?
That already looks like Sonnet 3x and 4 level capabilities to me where the model in question (Gemma 4) set ups whole python project with a UI and installs python libraries using uv etc.
Add this Simple Self Distillation to the picture and by 2028 I see cheaper coding model providers with much more generous usage limits in the future and power users would be mostly running their own models anyway.
Anyone using these models as "non-deterministic transpilers" from natural language to code (experienced engineers who can write code themselves) would probably not be paying to any AI providers.
[0] https://www.youtube.com/watch?v=-_hC-C_Drcw
Right now it feels like hammering a house onto a nail instead of the other way around.
LLMs have something that's not entirely unlike the "g factor" in humans - a broad "capability base" that spans domains. The best of the best "coding LLMs" need both good "in-domain training" for coding specifically and a high "capability base". And a lot of where that "base" comes from is: model size and the scale of data and compute used in pre-training.
Reducing the model scale and pruning the training data would result in a model with a lower "base". It would also hurt in-domain performance - because capabilities generalize and transfer, and pruning C code from the training data would "unteach" the model things that also apply to code in PHP.
Thus, the pursuit of "narrow specialist LLMs" is misguided, as a rule.
Unless you have a well defined set bar that, once cleared, makes the task solved, and there is no risk of scope adjustment, no benefit from any future capability improvements above that bar, and enough load to justify the engineering costs of training a purpose-specific model? A "strong generalist" LLM is typically a better bet than a "narrow specialist".
In practice, this is an incredibly rare set of conditions to be met.
There are hardware-based limitations in the size of LLMs you can feasibly train and serve, which imposes a limit in the amount of information you can pack into a single model's weights, and the amount of compute per second you can get out of that model at inference-time.
My company has been working on this specifically because even now most researchers don't seem to really understand that this is just as much an economics and knowledge problem (cf Hayek) as it is "intelligence"
It is much more efficient to strategically delegate specialized tasks, or ones that require a lot of tokens but not a lot of intelligence, to models that can be served more cheap. This is one of the things that Claude Code does very well. It's also the basis for MOE and some similar architectures with a smarter router model serving as a common base between the experts.
...with a fair amount of supervision, while frontier models would be running circles around them using project-specific memory and on-demand training (or whatever we would have by then).
If you're building something groundbreaking and new, the advantage will be slim to none.
Self-distillation has been shown recently as very efficient and effective back in January this year by MIT and ETH team in their Self-Distillation Fine-Tuning (SDFT) LLM system [1],[2].
This paper is also their closest competitor named On-Policy Self-Distillation in the comparison table.
I hope they keep the original work real name that is Self-Distillation Fine-Tuning or SDFT. Imagine later paper citing this very paper as cross-entropy self-distillation instead of their very own given name Simple Self-Distillation or SSD. Although I'd have admitted it's a lousy name that breaks the namespace with common SSD nomenclature for solid-dtate drive, as others have rightly pointed.
I think they should given the proper credit to this earlier seminal earlier on SDFT but apparently they just put it as one as of the systems in their benchmark but not explaining much of the connection and lineage which is a big thing in research publication.
[1] Self-Distillation Enables Continual Learning:
https://arxiv.org/abs/2601.19897
[2] Self-Distillation Enables Continual Learning:
https://self-distillation.github.io/SDFT.html
https://ai.meta.com/research/publications/adaptive-decoding-...
(Not fine tuning, but interesting none the less. If a model can so easily find a more elegant solution, why didn't it pick that in the first place?)
We really need to develop better tools to understand what's happening inside these NNs. Working with high-D spaces is not something we're good at, and we're basically throwing stuff at it and seeing if it sticks.
I suppose we just don't have a deeper underlying theory to lean on and help us 'design' anything.
I often find, if I've got a complicated solution, it’s because I haven’t fully examined the problem.
[0] https://news.ycombinator.com/item?id=46117802
[1] https://news.ycombinator.com/item?id=47107974
It’s all moonspeak to me. I tried reading other comments that explain this and they all sounded different or contradictory. I’ve studied ML as a hobby years ago but this was before the LLM explosion. Guess I need to start over again?
So you prompt the base model for answer and then rerun the prompt with the answer from the first run?
They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.
This effectively "folds" the logit tail truncation behavior into the model itself.
Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.
You use the outputs from the first run (right or wrong) as answers for the second training run, and repeat. Magically it works. That's what's so surprising.
I guess a theory is because there are so many diverse ways to be wrong that they don't accumulate error... still seems surprising and would be interesting to see if it works in other domains.
Maybe I'll try that myself, because it feels like it could be a great source of improvements. It would be really useful to see adaptive per-token sampling as an additional decode-only baseline.
Their hypothesis as to why this works requires a bit more knowledge about model architecture, but basically when a model generates code some positions have only one right answer and some have many valid options - but the model has to use one global confidence setting for both. Sampling with a specific temperature + a garbage-token filter, then training on those outputs, teaches the model to internalize 'be precise where there's one answer, stay open-minded where there are several' — without anyone labeling which is which.
Note that there's a lot more nuance to this and I simplified a lot.
You teach the machine by asking it to solve some problems, and then whatever answer it gives you say "That's exactly right. Now we train on those answers YOU just gave me" (even if they are wrong) and repeat. Somehow THAT works over time.
you can generate and train answers by exploring on varying the length of the code generated
So no, they are not fine-tuning a general purpose model to produce "valid benchmark code results."
If only this were true there wouldn't be an army of duffers who after a lifetime of "training" still dig a trench in front of the ball every time they play.
It's the first thing anyone would think of (like a self-hosted compiler) but everything I've read said "it doesn't work."
EDIT: For context:
Some of the claims about models training on their own data, in their enthusiasm to frame it as a failure, went further to suggest that it magnified biases. I had my doubts about their conclusions. If it were true, it would be a much greater breakthrough because the ability to magnify a property represents a way to measure a weak version that property. The ability to do that would mean they would have found a way to provide a training signal to avoid bias. It would be great if that's what they did but I suspect there would have been more news about it.
Perhaps this paper will put to rest the notion that AI output is useless as training data. It has only ever been the case that it was useless as an indiscriminate source of data.
I know virtually nothing about this area but my naive take is that something that means it still only passes tests around half the time doesn't seem like a particularly big jump forwards.
What am I missing?
But no-one quotes those any more because if everyone passes them, they don't serve any useful purpose in discriminating between different models or identifying advancements
So people switch to new benchmarks which either have more difficult tasks or some other artificial constraints that make them in some way harder to pass, until the scores are low enough that they're actually discriminating between models. and a 50% score is in some sense ideal for that - there's lots of room for variance around 50%.
(whether the thing they're measuring is something that well correlates to real coding performance is another question)
So you can't infer anything in isolation from a given benchmark score being only 50% other than that benchmarks are calibrated to make such scores the likely outcome
When designing a benchmark, a pass rate of roughly 50% is useful because it gives you the most information about the relative performance of different models. If the pass rate is 90%+ too often, that means the test is too easy: you're wasting questions asking the model to do things we already know it can do, and getting no extra information. And if it's too low then you're wasting questions at the other end, trying to make it do impossible tasks.
This feels eerily similar to sleep consolidation or synaptic pruning
I think the analogy is actually pretty specific to this paper, not just self-distillation in general.
During sleep your brain replays experiences but noisy and distorted. The replays are often incoherent as narratives (dreams are weird). But the consolidation still works because the value isn't in the narrative coherence, it's in the activation patterns at each moment. Important pathways get strengthened, weak ones get pruned. Section 4.4 of this paper is what makes the connection click. They cranked training temperature to 2.0 with no truncation. 62% of the sampled outputs had no extractable code. Coherent Python that devolves into multilingual gibberish halfway through. The model still improved (+5.7pp pass@1).
This makes no sense if you think the model is learning from good code examples. But it makes a lot of sense if you think of it as the model replaying its own knowledge back to itself in a noisy/distorted form, and the replay process strengthening what matters (sharp distributions at "lock" positions where one token is correct, broad distributions at "fork" positions where multiple approaches work) while pruning what doesn't (distractor tails). The model doesn't learn anything new. It just wakes up performing better because what it already knew got cleaned up.
How is this comment not at number 1??
Self-distillation shifts the behavior of the model towards that of the model + steering. As such, you don't strictly "need" the tokens to be in-domain for it to work. The logits are a vessel for transferring the steering into the model's internals.
The tokens can be gibberish. What transfers isn't whether they're gibberish or not, but how the flavor of model predictions, if given gibberish, differs from that of an unsteered version of itself.
In this specific case, the behavioral difference comes from the "temperature-shifted, truncated samples" in the "teacher" sampling strategy, and it is that difference that is internalized by the "student" model.
I.e sleep replays don’t need to replay Tuesday’s meeting accurately. They just need to activate the relevant pathways so that the strong ones fire and the weak ones don’t. The pattern of what fires versus what doesn’t is the signal. The “content” of the dream is basically irrelevant.
Mutation rate modulation is the AI engineers’ heat. And selection does the trimming of the outliers.
Some more serious biomorphic thinking and we may get to the next big insight courtesy of 3+ billion years of evolution—- evolution that enabled a great ape species to write a paper like this and build LMM’s like Gemma4 that totally rock on a 3.5 pound MacBookPro M5 Max with 128 GB of RAM.
My local running Mistral 7b is a 100x better at modern JavaScript than any model on the market, mainly just from RAG on my own code samples.
That's basically what they are describing with "post-training", the TLDR is that code especially of a certain style is vastly simpler than written language.
You really don't need a huge model or data centers etc. you just need a small but good model like Mistral 7b and literally a few good samples.
But you guys keep doing you lol. A bunch of non-devs trying to solve code is pretty funny to watch.
If you sample from the base model with T=1.6, top_k=20, top_p=0.8, i.e, the decode settings used for the distillation's ground truth, does it match the SSD'd model + some decoding? Performance wise.
Their sweep is missing this. And only covers "standard" decoding settings.
But to filter based on author's names sounds pretty darn racist.
They seemed like they had to be churning out papers and any little adaptation to existing research triggered a new publication.
But it may have changed now.
"Made in China, designed by Apple in California"
should be:
"Made in China, designed by Chinese people in California"?
Sorry apple, SSD is already taken, you can't use that acronym.
Consistency Preservation Update (CPU)
Guided Probability Update (GPU)
History-aware Distillation Driving (HDD)
Probability Smoothing Update (PSU)
Title should be: Simple Self-Distillation Improves Code Generation
https://en.wikipedia.org/wiki/Embarrassingly_parallel
Many computer science paper titles allude to past titles in other CS papers.
Calling it “cringe worthy” is unnecessarily mean. There is context and history you don’t understand.
There are two distinct billions. https://en.wikipedia.org/wiki/Billion