"Empirical Validation. While we cannot verify these theoretically, we evaluate each empirically. We use the Qwen-2.5-7B-Instruct model (Hui et al., 2024) as the base policy and the ToolAlpaca dataset (Tang et al., 2023). In this benchmark, the model receives a tool-API specification and a user request, and must identify the correct tool call. Without demonstrations, the base model solves only 42% of examples. When provided with the appropriate demonstration
c
for each prompt
x
, the teacher achieves a 100% success rate. To further test reward proximity, we manually inspected 50 teacher reasoning traces. In all cases, not only were the final tool calls correct, but the intermediate chain-of-thought was valid and semantically grounded. This suggests that the teacher is reconstructing a correct reasoning process rather than merely copying the expert output. These observations provide evidence for the first requirement, that the demonstration-conditioned model behaves as an optimal policy."
HarHarVeryFunny 14 hours ago [-]
The title seems a bit misleading.
The paper is about a way to do SFT will less chance of catastrophic forgetting and performance regressions.
The idea is that SFT on new data that was NOT generated by the model (aka "off policy" data) is likely to cause problems due to the statistical mismatch between the new data and what the model has already learnt. As I understand it, their solution is to statistically align the new data with the old by feeding it to the old model, which will hopefully grok it via in-context learning, then have it regenerate it in its own words such that "off policy" data now becomes "on policy". The model can then be SFT trained on this regenerated data (i.e self-distillation).
To me SFT and "continual learning" are two distinct things.
Human/animal continual learning is always-on learning that removes the need for, and distinction between, training and inference, and it initiated by prediction failure. It's as much about skill acquisition as it is about knowledge acquisition. Continual learning can happen in any context from trying to do something (or just observing something passively) and being wrong about the outcome of your own actions, or what some external entity does next, to curiosity/boredom driven exploration and play which is more along the spectrum of pure learning with less expectation of outcomes.
Continual learning is what, one day, will let the AGI intern pick up new skills on the job by trying to do things and failing/learning/practicing until they get better. This is not the same as sending the intern home with a textbook to read, or a transcript of the conversations you had with it today, and having it take these onboard overnight, which is basically what SFT is designed to do - intermittent addition of new declarative data.
airstrike 1 days ago [-]
Both title and abstract feel a little too confident, which ironically makes me more skeptical rather than less.
I find the choice of the words "enable" in the title and "establishing" at the end of the abstract to be particularly jarring.
teleforce 23 hours ago [-]
Fun facts, this paper is cited by Simple Self-Distillation (SSD) paper by Apple [1],[2]. I think it is a bad naming scheme due to the very common SSD namesake and the fact that it belongs to on-policy self-distillation [3]. But again according to the authors their proposed solution is simple because "SSD uses only temperature-shifted samples from the base model and standard cross-entropy training,without privileged context, feedback-conditioned teachers,or auxiliary supervision."
The Apple paper also cited another very similar idea of self-distillation paper by UCLA team. Both cited papers namely by MIT & ETH team, and the other by UCLA team proposed novel on-policy self-distillation technique. Interestingly both teams submitted their papers within one day from each other back in January this year to arXiv [4],[5]. No price for guessing who actually published the idea first.
IMHO, self-distillation fine-tuning is the future of LLM fine-tuning because it mitigates the forgetfulness of the SFT approach that can be cumbersome for lightweight fine-tuning rather than full post-training of LLM.
With the advent and proliferation of plethora open source and open weight LLM foundation models, anyone can fine-tuning these models for domain specialization or sub-specialization (like medicine sub-specialization, law disciplines, branches of architecture practices, etc) [6]. This fine-tuning process can be performed with the minimum resources of 8 H200 or even 4 H100 GPUs as reported respectively in either of the papers [4],[5]. Let's see if we can replicate that with much cheaper arrangements consisting of a couple of DGX Spark, or the latest eight of DGX Spark based nodes arrangement with a total of 1 TB RAM (128 GB x 8) [7],[8].
IMHO, if the results are valid, the self-distillation can be the second best thing happened to LLM after the transformer.
Very comprehensive, will give this a read, thanks man!
greesil 1 days ago [-]
Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand?
Gemini tells me it's the probability of the next token for an LLM. Okay then.
coderenegade 23 hours ago [-]
The policy is how you select your actions -- in this case, the next token. It can be random, but it doesn't have to be. "Deterministically choose the best action" is a valid policy (we would call it the greedy policy), as long as you have some other means of injecting stochasticity so the model explores the space. Uniform random is also a valid policy, as is always selecting the same token (it obviously wouldn't be very performant, and would defeat the purpose here, but it might be fine in, for example, a multi-armed bandit scenario). Most of the time, the policy is a parameterized distribution, and we want to learn the model parameters that maximize some measure of success (the reward component).
Off-policy versus on-policy refers to what data the model is trained on. On-policy training is where the training data is collected by the policy. Off-policy training is where the data was collected by a different sampling process (e.g. we have a standard dataset that we're going to use for supervised training).
Ifkaluva 23 hours ago [-]
It’s quite common these days to treat an LLM as a policy in the sense that it takes as a “state” the previous context, and its task is to choose a continuation, as an “action”. It gets a “reward” from a reward model that was trained on human preferences, or from a verifiable source, such as passing test cases.
This framing has been active for several years, as it’s the framing that enables RLHF and RLVR. RLHF itself is quite old, I think since the original chatGPT.
mountainriver 1 days ago [-]
What is this comment? It’s an RL paper, these are standard RL terms
greesil 1 days ago [-]
It's a comment. On Hacker News. Not the RL subreddit, or whatever. I'm just amazed at the jargon. I'm sure it's useful, but one could just call it model output.
That would be incorrect. My other reply attempts to address this.
greesil 23 hours ago [-]
But the probability vector is the output of the LLM, no?
antonvs 20 hours ago [-]
> But the probability vector is the output of the LLM, no?
In some contexts yes, but that's not actually the policy. As I wrote in my other comment (quoting because I think it's worth highlighting):
> "the policy is a function that, given some context, assigns probabilities to possible next tokens."
In the same sentence, I also incorrectly referred to this as a "probability distribution", but that's not accurate: it's a function that produces a probability distribution. The policy instantiated at a specific context produces a probability distribution.
In fact, you'd be closer to the mark if you called the policy "the model", but the two terms emphasize different aspects - as I said, "policy" views it from an RL perspective. From that perspective, the policy is a function, the model is an implementation of that function.
Besides, "output of the LLM" is ambiguous. It commonly means the actual generated token(s) (or text), not the probability distribution. Depending on context, "output of the LLM" could refer to (1) logits, (2) the probability distribution, (3) a single selected token, (4) the full generated text.
"Policy" has no such ambiguity - it has a precise definition. That's why technical subjects rely on jargon in the first place, but it results in the exact issue we're running into here: "Jargon enables quick and precise communication among insiders, but it is usually confusing or unintelligible to outsiders."
greesil 17 hours ago [-]
Yes, I understand one function of jargon, which can be useful to insiders in that it conveys a precise meaning. But, it can be confusing to outsiders, and that is also a useful thing for insiders. In the context of LLMs, what other function can produce p(next token) if not the LLM? And, you just about make my point for me with regards to jargon being confusing by misidentifying what the policy actually is (something i never would have noticed :) In any case, it's an interesting paper. Thanks for all your down votes everyone.
airstrike 16 hours ago [-]
The LLM is the whole car and policy is a specific part.
antonvs 9 hours ago [-]
> In the context of LLMs, what other function can produce p(next token) if not the LLM?
You're thinking about it from a specific implementation-oriented perspective. Policy is a well-defined theoretical concept that generalizes beyond LLMs - as we've discussed, it comes from RL. If one is discussing the use of RL techniques on LLMs, it can makes sense to use well-defined RL terminology.
> "At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent’s policy and is denoted π_t, where π_t(a|s) is the probability that A_t = a if S_t = s. Reinforcement learning methods specify how the agent changes its policy as a result of its experience. The agent’s goal, roughly speaking, is to maximize the total amount of reward it receives over the long run."
If you apply this definition to an LLM, you find that the model itself becomes the implementation of a policy. But narrowing one's thinking about this to purely thinking about it in terms of what an LLM happens to do to implement a policy is not necessarily a good idea for a researcher.
As Sutton & Barto go on to write:
> "This framework is abstract and flexible and can be applied to many different problems in many different ways. For example, the time steps need not refer to fixed intervals of real time; they can refer to arbitrary successive stages of decision-making and acting. The actions can be low-level controls, such as the voltages applied to the motors of a robot arm, or high-level decisions, such
as whether or not to have lunch or to go to graduate school. Similarly, the states can take a wide variety of forms. [...]"
Referring to this as a policy connects it to a much broader body of work that's highly relevant to the problem being studied.
---
> And, you just about make my point for me with regards to jargon being confusing by misidentifying what the policy actually is (something i never would have noticed :)
It's quite the opposite. The jargon exists to make things precise, so that it become easier to identify when some nuance has been accidentally dropped, as in this case. It's bad faith to claim that a mistake in my attempt to simplify things for you proves your point.
> But, it can be confusing to outsiders, and that is also a useful thing for insiders.
You should be careful that you're not using anti-intellectual conspiracy theorizing to justify your refusal to try to understand the purpose of terminology you happen to be unfamiliar with.
greesil 42 minutes ago [-]
But me asking questions in fact is trying to understand, is it not? I ask a stupid simple question with a slightly rude tone, and then I get downvoted by a bunch of pedantic insiders. Although to be fair it appears some are trying to help.
Look dude, every field develops its own terminology. It's not a conspiracy, just an emergent property. But it always makes getting into the field, or understanding what's in the field, much harder than it needs to be.
porridgeraisin 15 hours ago [-]
> some sort of RL thing I'm too ML to understand
Oh boy.
antonvs 23 hours ago [-]
Gemini didn't really say that exactly, did it? Because it's oversimplified to the point of being wrong.
“Policy” here refers to a probability distribution, i.e. a function that, given some context, assigns probabilities to possible next tokens. It's what a model’s behavior looks like when viewed through an RL lens.
The paper discusses “on-policy” and “off-policy” training, which is central to its idea.
Off-policy training is what happens in standard supervised fine-tuning (SFT): the model is trained on examples that were produced independently of the model. This means that the examples have a different distribution than what the model produces. This can have a negative effect on previously learned capabilities.
On-policy training (in this context) uses data generated by the model itself. It samples the model’s own outputs, scores them against whatever results are being trained for, and updates the model based on those scores. This reinforces certain aspects of the model's own pretrained behavior, so is a "gentler" way to change the model's behavior. The authors claim that this reduces "catastrophic forgetting" and other negative consequences of SFT.
sinsudo 20 hours ago [-]
Thanks, very good explanation. One question: One could mix both kind of policies, are there hybrid policies? (with samples both from the inner and outer distributions?), if so, how are they named?
porridgeraisin 15 hours ago [-]
Policies are not of two types. There is just _a_ policy. On- and off- policy are properties of the training process. If you learn a policy using data which was generated using another policy, it is off-policy. If the data was generated using the same policy, it is on-policy. The distinction matters because (very loosely) the nudges that the other policy's data tell you to make are based on the other policy's existing shape, which might be different from your current policy's shape. Typically, an algorithm itself is called off-policy if it does not care about the source of the data. Example: Q-learning. An algorithm is called on-policy if it requires the source of the data to be the policy itself. In practice, you always use a mixture of both, and apply techniques such as importance sampling to mitigate the off-policy data mismatch.
To answer your question, yes, you can use any mixture of data for your training process. Whenever you use off-policy data, depending on your objective, you might have to use some technique to "fix" your updates.
antonvs 20 hours ago [-]
> “Policy” here refers to a probability distribution, i.e. a function that, given some context, assigns probabilities to possible next tokens.
This should say "...refers to a function that produces a probability distribution." The latter half of the quoted sentence describes it correctly.
Rendered at 04:10:58 GMT+0000 (Coordinated Universal Time) with Vercel.
This is very interesting:
"Empirical Validation. While we cannot verify these theoretically, we evaluate each empirically. We use the Qwen-2.5-7B-Instruct model (Hui et al., 2024) as the base policy and the ToolAlpaca dataset (Tang et al., 2023). In this benchmark, the model receives a tool-API specification and a user request, and must identify the correct tool call. Without demonstrations, the base model solves only 42% of examples. When provided with the appropriate demonstration c for each prompt x , the teacher achieves a 100% success rate. To further test reward proximity, we manually inspected 50 teacher reasoning traces. In all cases, not only were the final tool calls correct, but the intermediate chain-of-thought was valid and semantically grounded. This suggests that the teacher is reconstructing a correct reasoning process rather than merely copying the expert output. These observations provide evidence for the first requirement, that the demonstration-conditioned model behaves as an optimal policy."
The paper is about a way to do SFT will less chance of catastrophic forgetting and performance regressions.
The idea is that SFT on new data that was NOT generated by the model (aka "off policy" data) is likely to cause problems due to the statistical mismatch between the new data and what the model has already learnt. As I understand it, their solution is to statistically align the new data with the old by feeding it to the old model, which will hopefully grok it via in-context learning, then have it regenerate it in its own words such that "off policy" data now becomes "on policy". The model can then be SFT trained on this regenerated data (i.e self-distillation).
To me SFT and "continual learning" are two distinct things.
Human/animal continual learning is always-on learning that removes the need for, and distinction between, training and inference, and it initiated by prediction failure. It's as much about skill acquisition as it is about knowledge acquisition. Continual learning can happen in any context from trying to do something (or just observing something passively) and being wrong about the outcome of your own actions, or what some external entity does next, to curiosity/boredom driven exploration and play which is more along the spectrum of pure learning with less expectation of outcomes.
Continual learning is what, one day, will let the AGI intern pick up new skills on the job by trying to do things and failing/learning/practicing until they get better. This is not the same as sending the intern home with a textbook to read, or a transcript of the conversations you had with it today, and having it take these onboard overnight, which is basically what SFT is designed to do - intermittent addition of new declarative data.
I find the choice of the words "enable" in the title and "establishing" at the end of the abstract to be particularly jarring.
The Apple paper also cited another very similar idea of self-distillation paper by UCLA team. Both cited papers namely by MIT & ETH team, and the other by UCLA team proposed novel on-policy self-distillation technique. Interestingly both teams submitted their papers within one day from each other back in January this year to arXiv [4],[5]. No price for guessing who actually published the idea first.
IMHO, self-distillation fine-tuning is the future of LLM fine-tuning because it mitigates the forgetfulness of the SFT approach that can be cumbersome for lightweight fine-tuning rather than full post-training of LLM.
With the advent and proliferation of plethora open source and open weight LLM foundation models, anyone can fine-tuning these models for domain specialization or sub-specialization (like medicine sub-specialization, law disciplines, branches of architecture practices, etc) [6]. This fine-tuning process can be performed with the minimum resources of 8 H200 or even 4 H100 GPUs as reported respectively in either of the papers [4],[5]. Let's see if we can replicate that with much cheaper arrangements consisting of a couple of DGX Spark, or the latest eight of DGX Spark based nodes arrangement with a total of 1 TB RAM (128 GB x 8) [7],[8].
IMHO, if the results are valid, the self-distillation can be the second best thing happened to LLM after the transformer.
[1] Embarrassingly simple self-distillation improves code generation (2026 - 201 comments):
https://news.ycombinator.com/item?id=47637757
[2] Embarrassingly Simple Self-Distillation Improves Code Generation:
https://arxiv.org/abs/2604.01193
[3] Comment on "Embarrassingly simple self-distillation improves code generation":
https://news.ycombinator.com/item?id=47644784
[4] Self-Distillation Enables Continual Learning:
https://arxiv.org/abs/2601.19897
[5] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:
https://arxiv.org/abs/2601.18734
[6] Why domain specific LLMs won't exist: an intuition (2026 - 4 comments):
https://news.ycombinator.com/item?id=47649167
[7] NVIDIA DGX Spark Review The GB10 Machine is so Freaking Cool:
https://www.servethehome.com/nvidia-dgx-spark-review-the-gb1...
[8] BIG AI Cluster Little Power the 8x NVIDIA GB10 Cluster:
https://www.servethehome.com/big-cluster-little-power-the-8x...
Gemini tells me it's the probability of the next token for an LLM. Okay then.
Off-policy versus on-policy refers to what data the model is trained on. On-policy training is where the training data is collected by the policy. Off-policy training is where the data was collected by a different sampling process (e.g. we have a standard dataset that we're going to use for supervised training).
This framing has been active for several years, as it’s the framing that enables RLHF and RLVR. RLHF itself is quite old, I think since the original chatGPT.
That would be incorrect. My other reply attempts to address this.
In some contexts yes, but that's not actually the policy. As I wrote in my other comment (quoting because I think it's worth highlighting):
> "the policy is a function that, given some context, assigns probabilities to possible next tokens."
In the same sentence, I also incorrectly referred to this as a "probability distribution", but that's not accurate: it's a function that produces a probability distribution. The policy instantiated at a specific context produces a probability distribution.
In fact, you'd be closer to the mark if you called the policy "the model", but the two terms emphasize different aspects - as I said, "policy" views it from an RL perspective. From that perspective, the policy is a function, the model is an implementation of that function.
Besides, "output of the LLM" is ambiguous. It commonly means the actual generated token(s) (or text), not the probability distribution. Depending on context, "output of the LLM" could refer to (1) logits, (2) the probability distribution, (3) a single selected token, (4) the full generated text.
"Policy" has no such ambiguity - it has a precise definition. That's why technical subjects rely on jargon in the first place, but it results in the exact issue we're running into here: "Jargon enables quick and precise communication among insiders, but it is usually confusing or unintelligible to outsiders."
You're thinking about it from a specific implementation-oriented perspective. Policy is a well-defined theoretical concept that generalizes beyond LLMs - as we've discussed, it comes from RL. If one is discussing the use of RL techniques on LLMs, it can makes sense to use well-defined RL terminology.
Here's a definition from Sutton & Barto's RL intro (https://web.stanford.edu/class/psych209/Readings/SuttonBarto...):
> "At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent’s policy and is denoted π_t, where π_t(a|s) is the probability that A_t = a if S_t = s. Reinforcement learning methods specify how the agent changes its policy as a result of its experience. The agent’s goal, roughly speaking, is to maximize the total amount of reward it receives over the long run."
If you apply this definition to an LLM, you find that the model itself becomes the implementation of a policy. But narrowing one's thinking about this to purely thinking about it in terms of what an LLM happens to do to implement a policy is not necessarily a good idea for a researcher.
As Sutton & Barto go on to write:
> "This framework is abstract and flexible and can be applied to many different problems in many different ways. For example, the time steps need not refer to fixed intervals of real time; they can refer to arbitrary successive stages of decision-making and acting. The actions can be low-level controls, such as the voltages applied to the motors of a robot arm, or high-level decisions, such as whether or not to have lunch or to go to graduate school. Similarly, the states can take a wide variety of forms. [...]"
Referring to this as a policy connects it to a much broader body of work that's highly relevant to the problem being studied.
---
> And, you just about make my point for me with regards to jargon being confusing by misidentifying what the policy actually is (something i never would have noticed :)
It's quite the opposite. The jargon exists to make things precise, so that it become easier to identify when some nuance has been accidentally dropped, as in this case. It's bad faith to claim that a mistake in my attempt to simplify things for you proves your point.
> But, it can be confusing to outsiders, and that is also a useful thing for insiders.
You should be careful that you're not using anti-intellectual conspiracy theorizing to justify your refusal to try to understand the purpose of terminology you happen to be unfamiliar with.
Look dude, every field develops its own terminology. It's not a conspiracy, just an emergent property. But it always makes getting into the field, or understanding what's in the field, much harder than it needs to be.
Oh boy.
“Policy” here refers to a probability distribution, i.e. a function that, given some context, assigns probabilities to possible next tokens. It's what a model’s behavior looks like when viewed through an RL lens.
The paper discusses “on-policy” and “off-policy” training, which is central to its idea.
Off-policy training is what happens in standard supervised fine-tuning (SFT): the model is trained on examples that were produced independently of the model. This means that the examples have a different distribution than what the model produces. This can have a negative effect on previously learned capabilities.
On-policy training (in this context) uses data generated by the model itself. It samples the model’s own outputs, scores them against whatever results are being trained for, and updates the model based on those scores. This reinforces certain aspects of the model's own pretrained behavior, so is a "gentler" way to change the model's behavior. The authors claim that this reduces "catastrophic forgetting" and other negative consequences of SFT.
To answer your question, yes, you can use any mixture of data for your training process. Whenever you use off-policy data, depending on your objective, you might have to use some technique to "fix" your updates.
This should say "...refers to a function that produces a probability distribution." The latter half of the quoted sentence describes it correctly.