With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.
danielhanchen 1 days ago [-]
Oh I didn't expect this to be on HN haha - but yes for our new benchmarks for Qwen3.5, we devised a slightly different approach for quantization which we plan to roll out to all new models from now on!
nnx 1 days ago [-]
Can you describe what is this slightly different approach and why it should work on all models?
hedora 1 days ago [-]
Nice! Your stuff ran LLMs extremely well on < $500 boxes (24-32GB ram) with iGPUS before this update.
I’m eager to try it out, especially if 16GB is viable now.
gundmc 19 hours ago [-]
The 5080 is 16GB VRAM, not system memory. I don't think you can get 24-32GB VRAM in a $500 box
Kayou 2 days ago [-]
Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had
Maxious 2 days ago [-]
Yep. These Mixture of Experts models are well suited for paging in only the relevant data for a certain task https://huggingface.co/blog/moe
MoE is not suited for paging because it’s essentially a random expert per token. It only improves throughput because you reduce the memory bandwidth requirements for generating a token since 1/n of the weights are accessed per token (but a different 1/n on each loop).
Now shrinking them sure, but I’ve seen nothing that indicates you can just page weights in and out without cratering your performance like you would with a non MoE model
FuckButtons 1 days ago [-]
Not entirely true, it’s random access within the relevant subset of experts and since concepts are clustered you actually have a much higher probability of repeatedly accessing the same subset of experts more frequently.
vlovich123 22 hours ago [-]
It’s called mixture of experts but it’s not that concepts map cleanly or even roughly to different experts. Otherwise you wouldn’t get a new expert on every token. You have to remember these were designed to improve throughput in cloud deployments where different GPUs load an expert. There you precisely want each expert to handle randomly to improve your GPU utilization rate. I have not heard anyone training local MoE models to aid sharding.
cagenut 21 hours ago [-]
is there anywhere good to read/follow to get operational clarity on this stuff?
my current system of looking for 1 in 1000 posts on HN or 1 in 100 on r/locallama is tedious.
p1esk 18 hours ago [-]
Ask any of the models to explain this to you
bee_rider 1 days ago [-]
That blog post was super interesting. It is neat that he can select experts and control the routing in the model—not having played with the models in detail, tended to assume the “mixing” in mixture of experts was more like a blender, haha. The models are still quite lumpy I guess!
segmondy 2 days ago [-]
llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.
pyuser583 18 hours ago [-]
How much do you use?
I have lots of trouble figuring out what the limits are of a system with x amount of vram and y amounts of ram. How do you determine this?
fc417fc802 14 hours ago [-]
Ideally you'd have (parameter count) * (bits per parameter) VRAM for the entire (presumably quantized, don't forget to account for that) model. So very approximately 16 GiB for a 34B model quantized to 4 bits per parameter.
You can spill to RAM in which case you at least want enough for a single active expert but really that's going to tank performance. If you're only "a bit" short of the full model the difference might not be all that large.
These things are memory bandwidth limited so if you check out RAM, VRAM, and PCIe bandwidth what I wrote above should make sense.
Also you should just ask your friendly local LLM these sorts of questions.
Koffiepoeder 1 days ago [-]
The A3B part in the name stands for `Active 3B`, so for the inference jobs a core 3B is used in conjunction with another subpart of the model, based on the task (MoE, mixture of experts). If you use these models mostly for related/similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models.
nurettin 1 days ago [-]
This is why they say "A3B" meaning only 3B is active at a time, limiting VRAM usage.
roxolotl 1 days ago [-]
What method are you using to do that? I’ve been playing with llama.cpp a lot lately and trying to figure out the cleanest options for getting a solid context window on 32gb vram and 64gb system ram.
jychang 1 days ago [-]
32GB vram is more than enough for Qwen 3.5 35b
You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.
If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.
roxolotl 1 days ago [-]
Nice ok I’ll play with that. I’m mostly just learning what’s possible. Qwen 3.5 35b has been great without any customizations but it’s interesting to learn what the options are.
I believe it's mentioned that MXFP4 performs surprisingly bad, you may want to try other Q4s.
cpburns2009 1 days ago [-]
Does llama.cpp support Qwen3.5 yet? When I tried it before, it failed saying "qwen35moe" is an unsupported architecture.
hnfong 1 days ago [-]
Yes, but make sure you grab the latest llama.cpp release
New model archs usually involve code changes.
sowbug 19 hours ago [-]
If you're running Ollama, you'll have to wait a little longer for its embedded version of llama.cpp to catch up. It can be a couple days or weeks behind.
cpburns2009 1 days ago [-]
Awesome! It looks like the llama.cpp-hip AUR was updated today to b8179, and it works.
reactordev 1 days ago [-]
You would need the Dynamic 2.0 GGUF as discussed in the article.
But mmmmmm, Q8_K_XL looks mighty nice.
RS-232 1 days ago [-]
That’s intriguing. I have the same card, maybe I should give it a go. Curious about your CPU/RAM/storage capacity as well.
Any resources for configuring the local setup?
My entire home media stack is a single compose file in a WSL distro so it would be cool if local LLM worked the same way.
jychang 2 days ago [-]
Not really breakthroughs, more like bugfixes for their broken first batch.
Questions for a postmortem that the blog post left unanswered:
- Why the change? Is it just to improve PPL/KLD? Sure, we can assume PPL and KLD are not perfect benchmarks. If yes, then why change the quantization anyways? Or was the old 2/24 quant actually much worse performing in the real world?I presume the Q4_K_XL quant using mxfp4 was the issue? If the 2/24 files having a lower PPL is an actual issue due to low quality tensors, then why not just say that?
- What were the main tensors that had the quantizations changed from 2/24 to 2/27? Did you now quantize attention tensors differently? Or perhaps ssm? T
- What was it changed from? Was it changed from mxfp4 or q4_k to q8, or something else?
A quick sentence in the blog post saying "ok, we've confirmed that using mxfp4 (or q3 or whatever) in the attention/ssm/biases/norms/etc is a bad idea, we had that in our old models on 2/24 and our new models today are better" that would make it clear. As it's written, it's trying to both say "PPL/KLD don't actually reflect real world quality" and "we changed our quant to increase PPL/KLD" at the same time, which seems contradictory.
zargon 1 days ago [-]
Explain what about that statement is false. Your original Q4_K_XL quant was broken. People noticing that it was a total outlier among other quants is what prompted this "research". Your own data proves that your new release fixes the bugs of your original, in order to match AesSedai's PPL. Fixing bugs is great. Searching for the best quant mix is helpful. I use your quants and appreciate your work. But whitewashing this situation dilutes trust and good will.
Archit3ch 1 days ago [-]
What's the verdict for real world use on Q3 120B (fits in 64GB) vs Q4 of a smaller model?
FuckButtons 1 days ago [-]
Bigger model wins as long as the quantization was done properly.
jychang 2 days ago [-]
What's up with this post? It's a link to something which has existed for a long time, and there's a bunch of dead comments below. Some weird SEO campaign thing?
tosh 2 days ago [-]
Unsloth have just released benchmarks on how their dynamic quants perform for Qwen 3.5
I'm aware of that, but that's not the link of the post. The post is linking to their UD 2.0 quants from a few months back.
Also, the benchmarks are because they messed up the first version of their Qwen 3.5 XL quants by quanting some tensors to mxfp4 that should have been in higher quality, and this is their bugfix. The post literally starts out with "We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits" without explaining WHY they needed to update from the original version.
danielhanchen 1 days ago [-]
Didn't expect this to be on HN haha - but sometimes HN does have older posts come up sometimes.
If you read our blog, it says KLD and PPL are actually sometimes counterintuitive - for example MiniMax some of our quants do worse on PPL and KLD vs AesSedai's one for example, but does worse on LiveCodeBench by a lot see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-3-...
This is because see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-1-... - although bitwidths are in general monotonic ie q2_k < q3_k < q4_k < q5_k etc, we find KLD and PPL are actually not monotonic ie q3_k can actually have BETTER PPL than q4_k.
So the main point is bad luck on quantization - sometimes lower bits might get lower PPL and KLD, but actually this is a ruse and wrong, since on actual real world tasks, it's worse.
jychang 1 days ago [-]
The Q4_K_XL is easily the most popular quant for the model, though.
So then why was Q4_K_XL having issues? Is it just a PPL issue that doesn't reflect in real world usage? If yes, why not just say that? "The Q4_K_XL had lower PPL, but don't worry, PPL can be wrong, and other benchmarks show it's fine". If it was a real quality issue, then where was the issue caused by?
The blog post says "Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE" but doesn't say why. The easy assumption that most people would make is "oh, you quanted attention or ssn or something to mxfp4 and that turned out to be bad, so you retire mxfp4" but if you say that it's not that, then what's the actual issue?
segmondy 1 days ago [-]
each layer is made up of various weights, the weights are adjusted to quant it. a pure q8 will have all the weights as q8, or a q4 the same. but some are kept as f32, etc. here's an example of q3_k_xl - https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF/tree/ma... we can see certain weights are f32, q8, q5, q3, etc. They used mxfp4 in some weights and mxfp4 doesn't seem to place nicely in quants so that's why they are retiring it. read their publication again and it should make more sense.
jychang 24 hours ago [-]
I am aware of all that.
They literally never say “they used mxfp4 in some weights”. What you’re claiming they said doesn’t exist.
This isn’t a postmortem, it’s PR fluff without actually addressing the issue.
"MXFP4 is much worse on many tensors - attn_gate, attn_q, ssm_beta, ssm_alpha using MXFP4 is not a good idea, and rather Q4_K is better - also MXFP4 uses 4.25 bits per weight, whilst Q4_K uses 4.5 bits per weight. It's better to use Q4_K than MXFP4 when choosing between them."
The Q4 quants had a mixture of mxfp4 leading to worse outcomes.
az226 2 hours ago [-]
I’m curious how NVFP4 compares to their Q4.
lostmsu 2 days ago [-]
Looking at their benchmarks there doesn't appear to be meaningful difference between their quants and bartowsky quants.
Barely noticeable drop in PPL; noticeable KLD drop (good, 5%); but worse KLD mean (bad, 5%).
danielhanchen 19 hours ago [-]
You forgot to check the disk sapce - _M and _XL are not the same across quants:
Unsloth Q4_K_M 18.49GB 0.5478 KLD 99.9% 0.0192 mean
Unsloth Q4_K_XL 19.17GB 0.4097 KLD 99.9% 0.0137 mean
bartowski Q4_K_M 19.77GB 0.5771 KLD 99.9% 0.0182 mean
lostmsu 12 hours ago [-]
The table doesn't have bartowski Q4_K_XL to compare, but given the metrics of _Ms aren't universally better it's unclear if smaller size doesn't come with a cost.
danielhanchen 1 days ago [-]
Didn't expect this as well haha on HN again - probably related to Qwen3.5
qskousen 1 days ago [-]
This is pretty interesting, based on the blog post, it seems like they are using a technique similar to what I have been using to generate "layer sensitivity" data in my (still pretty beta) ggufy project, which is more aimed at diffusion (image) models.
https://github.com/qskousen/ggufy
electroglyph 2 days ago [-]
Cheers Daniel and Mike and team, keep up the good work!
danielhanchen 1 days ago [-]
Thank you!
tenpa0000 2 days ago [-]
I run Llama 3.2 3B locally for latency-sensitive classification (sub-50ms, so no room for bigger models). At that scale Q2_K vs Q4_K_M isn't just smaller — Q2 starts flipping yes/no answers that Q4 gets right. Not often, but enough to notice in production.
So the KL divergence numbers here are more useful to me than the MMLU tables honestly. I've had MMLU hold steady while the output distribution drifted enough to break things downstream.
Does the calibration dataset make much difference at 3B though? There's so little redundancy that I'd expect it to hit a floor pretty fast regardless of how good the calibration data is.
am17an 1 days ago [-]
What do you use for sub-50ms inference?
zozbot234 2 days ago [-]
For a simple classification task you generally want to prioritize regularization over more sophisticated behavior, so fewer parameters with larger quantization makes sense. For more generic chat-like purposes, Q2 of a larger model may often be preferable to Q4 of a smaller one.
santa_boy 1 days ago [-]
Great timing. I downloaded the models today on LM Studio, they seem to work remarkably well.
Any HN model recommendations to run on my 24GB M5 and any best practices while running them?
Havoc 2 days ago [-]
Advances in this space are always welcome.
I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc
I love the work unsloth is doing. I only wish gguf format had better vllm support. It’s sometimes hard to find trustworthy quants that work well with vllm.
dyl000 2 days ago [-]
So q6 is practically perfect, and q3 is meaningfully decent. very impressive!
oofbey 16 hours ago [-]
What does it mean to say that “99.9% KL divergence” is some number like 3? In AI research and math, KL divergence is a pseudo-distance metric from one distribution to another. (Not technically a distance between two distributions because it’s asymmetric.)
Folks here who spend lots of time thinking about compressing models apparently have some specific interpretation of the term. Can somebody educate me? Because I only
Understand the math definition.
oofbey 3 hours ago [-]
The confusing thing here is that there are two distributions involved here. There's the distribution over the vocabulary (possible values of each token) and the distribution over the sequence of tokens in each document.
Here, the KL Divergence is calculated over the vocabulary's distribution - for a specific token, it is measuring how much the quantized model's predictions differ from the reference model. 0 means a perfect match (no loss of quality from quantizaton), and some large number like 4 nats meaning the quantized model's predictions for that token differ substantially from the reference model.
The 99.9% is taken over the sequence of tokens. So it ranks all the tokens in a corpus, and it effectively finds the token with the worst predictions (relative to the reference model) out of every 1000 tokens. That's the 99.9%ile part.
raphaelmolly8 1 days ago [-]
[dead]
aichen_dev 2 days ago [-]
[dead]
MarcLore 2 days ago [-]
[dead]
shablulman 2 days ago [-]
[dead]
roolgo 1 days ago [-]
[flagged]
CaptainFever 16 hours ago [-]
Rude.
Rendered at 23:03:47 GMT+0000 (Coordinated Universal Time) with Vercel.
With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.
I’m eager to try it out, especially if 16GB is viable now.
There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/
Now shrinking them sure, but I’ve seen nothing that indicates you can just page weights in and out without cratering your performance like you would with a non MoE model
my current system of looking for 1 in 1000 posts on HN or 1 in 100 on r/locallama is tedious.
I have lots of trouble figuring out what the limits are of a system with x amount of vram and y amounts of ram. How do you determine this?
You can spill to RAM in which case you at least want enough for a single active expert but really that's going to tank performance. If you're only "a bit" short of the full model the difference might not be all that large.
These things are memory bandwidth limited so if you check out RAM, VRAM, and PCIe bandwidth what I wrote above should make sense.
Also you should just ask your friendly local LLM these sorts of questions.
You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.
If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.
New model archs usually involve code changes.
But mmmmmm, Q8_K_XL looks mighty nice.
Any resources for configuring the local setup?
My entire home media stack is a single compose file in a WSL distro so it would be cool if local LLM worked the same way.
Old 2/24 Q4_K_XL commit (pre bugfix files): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/commit/7...
Questions for a postmortem that the blog post left unanswered:
- Why the change? Is it just to improve PPL/KLD? Sure, we can assume PPL and KLD are not perfect benchmarks. If yes, then why change the quantization anyways? Or was the old 2/24 quant actually much worse performing in the real world?I presume the Q4_K_XL quant using mxfp4 was the issue? If the 2/24 files having a lower PPL is an actual issue due to low quality tensors, then why not just say that?
- What were the main tensors that had the quantizations changed from 2/24 to 2/27? Did you now quantize attention tensors differently? Or perhaps ssm? T
- What was it changed from? Was it changed from mxfp4 or q4_k to q8, or something else?
A quick sentence in the blog post saying "ok, we've confirmed that using mxfp4 (or q3 or whatever) in the attention/ssm/biases/norms/etc is a bad idea, we had that in our old models on 2/24 and our new models today are better" that would make it clear. As it's written, it's trying to both say "PPL/KLD don't actually reflect real world quality" and "we changed our quant to increase PPL/KLD" at the same time, which seems contradictory.
https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
Also, the benchmarks are because they messed up the first version of their Qwen 3.5 XL quants by quanting some tensors to mxfp4 that should have been in higher quality, and this is their bugfix. The post literally starts out with "We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits" without explaining WHY they needed to update from the original version.
No your conclusion is false - only the old Q4_K_XL had slightly higher perplexity, all other quants are fine. We uploaded 9TB of research artifacts to https://huggingface.co/unsloth/Qwen3.5-35B-A3B-Experiments-G... for the community.
If you read our blog, it says KLD and PPL are actually sometimes counterintuitive - for example MiniMax some of our quants do worse on PPL and KLD vs AesSedai's one for example, but does worse on LiveCodeBench by a lot see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-3-...
This is because see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-1-... - although bitwidths are in general monotonic ie q2_k < q3_k < q4_k < q5_k etc, we find KLD and PPL are actually not monotonic ie q3_k can actually have BETTER PPL than q4_k.
So the main point is bad luck on quantization - sometimes lower bits might get lower PPL and KLD, but actually this is a ruse and wrong, since on actual real world tasks, it's worse.
So then why was Q4_K_XL having issues? Is it just a PPL issue that doesn't reflect in real world usage? If yes, why not just say that? "The Q4_K_XL had lower PPL, but don't worry, PPL can be wrong, and other benchmarks show it's fine". If it was a real quality issue, then where was the issue caused by?
The blog post says "Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE" but doesn't say why. The easy assumption that most people would make is "oh, you quanted attention or ssn or something to mxfp4 and that turned out to be bad, so you retire mxfp4" but if you say that it's not that, then what's the actual issue?
They literally never say “they used mxfp4 in some weights”. What you’re claiming they said doesn’t exist.
This isn’t a postmortem, it’s PR fluff without actually addressing the issue.
"MXFP4 is much worse on many tensors - attn_gate, attn_q, ssm_beta, ssm_alpha using MXFP4 is not a good idea, and rather Q4_K is better - also MXFP4 uses 4.25 bits per weight, whilst Q4_K uses 4.5 bits per weight. It's better to use Q4_K than MXFP4 when choosing between them."
The Q4 quants had a mixture of mxfp4 leading to worse outcomes.
Unsloth Q4_K_M 18.49GB 0.5478 KLD 99.9% 0.0192 mean
Unsloth Q4_K_XL 19.17GB 0.4097 KLD 99.9% 0.0137 mean
bartowski Q4_K_M 19.77GB 0.5771 KLD 99.9% 0.0182 mean
So the KL divergence numbers here are more useful to me than the MMLU tables honestly. I've had MMLU hold steady while the output distribution drifted enough to break things downstream.
Does the calibration dataset make much difference at 3B though? There's so little redundancy that I'd expect it to hit a floor pretty fast regardless of how good the calibration data is.
Any HN model recommendations to run on my 24GB M5 and any best practices while running them?
I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc
Folks here who spend lots of time thinking about compressing models apparently have some specific interpretation of the term. Can somebody educate me? Because I only Understand the math definition.
Here, the KL Divergence is calculated over the vocabulary's distribution - for a specific token, it is measuring how much the quantized model's predictions differ from the reference model. 0 means a perfect match (no loss of quality from quantizaton), and some large number like 4 nats meaning the quantized model's predictions for that token differ substantially from the reference model.
The 99.9% is taken over the sequence of tokens. So it ranks all the tokens in a corpus, and it effectively finds the token with the worst predictions (relative to the reference model) out of every 1000 tokens. That's the 99.9%ile part.