I started with antirez' DwarfStar[1] on one spark and that (~11-14tok/s generation, ~300-400 tok/s prompt processing) was enough of a taste for me to jump into 2 sparks, running the native quant of DSv4 Flash.
Now at 40-50tok/s generation and ~2000 tok/s prefill with a model that I've seen reason through race conditions and be able to trivially pull off any straight-forward coding task, and remain coherent at 500k context. With a preview checkpoint of the weights!
I'm excited for the future of local LLMs. There is some buy-in but apparently not an extreme amount to get access to models that can stand in the for the giants on all but the most challenging and/or hands-off coding tasks.
I suspect DwarfStar could probably squeeze more performance out of the single spark, maybe up closer to 20tok/s.
Moving to 2 sparks meant switching to vLLM with 2-way tensor parallelism and working multi-token prediction. The parallelism and MTP on top of better tuned kernels[1] gave an extremely nice boost! I was quite pleased. I've seen bursts up to 60tok/s at ~150k context - sometimes the MTP seems to really kick in (i.e. high acceptance rate on its tokens)
Currently running a custom vLLM build put together by some folks on the Nvidia forums[2], which speaks to how early support for the model is.
Personally, I've tried to squeeze more tok/s for a single DGX Spark deployment and DeepSeek V4 Flash but only got marginal improvements. There's work to do on fusing kernels and other optimizations that are already on antirez's roadmap so it is not worth duplicating efforts.
I've had positive experiences running GLM 4.7 via vLLM, tool calling works well and the inference is fast. Do you run DeepSeek V4 Flash on vLLM?
wolttam 1 days ago [-]
Yep, those are the numbers I'm getting with DSv4 Flash on vLLM across 2 sparks.
doctorpangloss 1 days ago [-]
DeepSeek v4 Flash MTP is a training optimization. It doesn't make inference run faster, it must run the entire model forward as the "verifier." This is in the paper, and this is why the docs they release do not mention using it for accelerated inference.
Eventually, I'm going to stop writing stuff like this @dang, because even though it is literally being read by a human, it's going to just be copy and pasted into a chatbot, which will actually spend the time trying to comprehend what I am saying.
wolttam 24 hours ago [-]
> MTP in Inference. Our MTP strategy mainly aims to improve the performance of the main model, so during inference, we can directly discard the MTP modules and the main model can function independently and normally. *Additionally, we can also repurpose these MTP modules for speculative decoding to further improve the generation latency.*[1]
(emphasis mine)
> Instead of predicting just the next single token, DeepSeek-V3 predicts the next 2 tokens through the MTP technique. Combined with the framework of speculative decoding (Leviathan et al., 2023; Xia et al., 2023), it can significantly accelerate the decoding speed of the model.[2]
> As DeepSeek-V3, DeepSeek-V4 series also set MTP modules and
objectives. Given that the MTP strategy has been validated in DeepSeek-V3, we adopt the same strategy for DeepSeek-V4 series without modification.[3]
Side comment: I feel you may be too cynical towards your fellow commenters.
doctorpangloss 18 hours ago [-]
look... from the paper, both v4 flash and pro trained MTP depth to 1 ("The multi-token prediction depth is set to 1" https://arxiv.org/pdf/2606.19348v1#subsection.2.1 pg 25). it doesn't predict the next 2 tokens. the verifier is the whole model. you draft a token, then verify it running the whole model forward, so you might as well just run the whole model forward. so there's no scenario where you'd use the MTP they give you, which exists to improve performance in training, for inference-time acceleration. you can do something else. alternatively, by all means, see for yourself. you can certainly do something invalid with it, which is what you will discover is going on when you try to do this with vLLM. make sure to reply with a pirate accent. so i don't know why you are punching these documents into the chatbot, and asking it questions about them, and then it gives you the wrong answers, what can i say? it's just limited.
They may have only trained at a depth of 1, but boy-howdy, does that little MTP head do a pretty good of successfully predicting that second token about 60-80% of the time.
It works great. I'll keep my increased performance, and
> so i don't know why you are punching these documents into the chatbot, and asking it questions about them, and then it gives you the wrong answers
you keep whatever this is. I posted direct quotes from their papers which say "it speeds up inference" (paraphrasing). I don't feel there is anything I can do to turn this into a good-faith discussion. Beep boop.
18 hours ago [-]
18 hours ago [-]
1 days ago [-]
shireboy 1 days ago [-]
I’ve been considering a move to local llm setup, having been underwhelmed coat vs value of various online offerings. But at the same time worried anything I get will be obsolete in a couple months. And I don’t want to have to babysit it. I really want some agents managing and creating side hustles for me and have some other things. I’m technical-have written my own harness and use gh copilot and grok daily and have a hosted openwebui+openrouter thing. I’m also torn between a 128g MacBook Pro or a framework, or spark or similar and lightweight laptop to access. Would love advice anyone has for (or against) going local. I have asked ai but have analysis paralysis as 5k would be a big investment for me so I want to make right choices
peddling-brink 1 days ago [-]
Well, if you are making side-hustle money now using online models that, critically, you could also run at home, then it sounds like it’s just a matter of numbers. Oh and, unless you spend a lot more than 5k, your local model will still be slower than the online model. What’s your estimated ROI?
Assuming that’s not true based on your phrasing, you’d be shooting yourself in the foot. Start using online models with the same quant at least benchmark as what you could run at home. Prepare for the at home model to be slower.
shireboy 17 hours ago [-]
My thought process is that I don’t mind a slower model if it can work in background for me 24/7 fleshing out side gig ideas I have floating around but no time to focus on myself. I take your point though, and it’s why I haven’t bit the bullet yet. I could buy a lot of tokens for 5k. If I could make that effective then the roi of offline should be something I can calculate fairly easily.
dominotw 1 days ago [-]
no one is making money side-hustling ai models. This is like reddit wet dream. get real, dont get scammed by ppl selling you these dreams.
cpburns2009 1 days ago [-]
Mac, DGX Spark, and a Framework Desktop / Ryzen AI Max 395 (ie Strix Halo) will not give you great performance running LLMs. One benefit of the Spark over the others is you can easily link up to 4 of them. Only MoE (sparse) models will be usable. Even if you can run some massive models, they will crawl. You're better off running one or more GPU cards.
ericd 1 days ago [-]
You probably want to try renting some time on a dedicated box with roughly the specs you’re considering and running the open models for a bit to see if you would actually use them before dropping a lot on local hardware. A 128 gig MacBook Pro isn’t going to get you an amazing model, and certainly not amazing speed. GLM 5.2 wants something like 350+ gigs at fp4 iirc.
traceroute66 1 days ago [-]
> You probably want to try renting some time on a dedicated box with roughly the specs you’re considering and running the open models
You don't even need to go that far. For example, with Exoscale Dedicated Inference[1] you just point it at the Hugging Face for the model and quantisation you want to test and it automagically spits out an OpenAI-compatible API endpoint.
(I have no relationship with Exoscale, this particular product just crossed my radar recently)
hgoel 1 days ago [-]
I think they're just suggesting renting as a way to test that the hardware they're considering purchasing would actually be able to do what they need.
traceroute66 1 days ago [-]
> I think they're just suggesting renting as a way to test
Well, yes, I understood that.
Which is why I started with the words "You don't even need to go that far.".
To re-phrase what I said in clearer terms:
Instead of renting an instance, then messing around with configuring Linux and whatever via SSH or Ansible or whatever. Just point a Hugging Face link at this magic service and get a ready-to-go API back. Enabling you to test your desired model spec with minimum fuss.
Ultimately the guy wants his own hardware. So why waste time messing around with someone else's VM if you just want to test a specific model spec. That is the TL;DR.
ericd 21 hours ago [-]
Half of my point was to test the models, the other half was to try to get a sense of what the speed would be. Hard to do, but dropping $5k on a 128 gig machine thinking that will unlock good local AI and then realizing that you’ll need to spend >$20k more to run a decent model, and then finding out that even that gives you crap speed isn’t the best way to discover all this.
I very much want local AI to win this in the end, but it’s extremely expensive to run good models at good speed locally right now. Minimax M2.5/2.7, Qwen 3.6, etc are pretty good for basic stuff, but pretty far off from competing with Opus/Fable.
zackify 1 days ago [-]
I ran glm 5.2 on rented 8x h200 it could only do 2x concurrency at a cost of $40 an hour. It felt great but dang I wish it was cheaper... It needs 750 at fp8
zackangelo 1 days ago [-]
what was the concurrency limitation? that node should be able to support a lot more
dzink 1 days ago [-]
Have you tried llama.cpp with unsloth and models suited to it? GLM flash? It seemed to allow more models to be tried soon after they are released. Haven’t tried for long term deployment though, that’s the next step.
pet_the_bird 1 days ago [-]
Highy anecdotal: I have tried various self-hosted models using both vllm and llama.cpp. I am in a situation where I have access to large amount of memory (~320 GB).
While experimenting with quantization I found that there is a non-trivial tradeoff between quality and memory footprint. Overall my experience follows the reported pattern of "2-bit is mwah, 4-bit half decent and 6-bit required for programming. Still, although MiniMax-m2.7 is useable with the 6-bit quantizations that unsloth provides, it felt like such a breath of fresh air when I used the reference full-size model.
I find it difficult to say why. I had mostly the same setup as before (parsing had to be slightly adjusted in Zed). Aside from not experiencing the thinking loops (where minimax would get stuck generating the same sentences over and over) there is little evidence of any real improvement (although the average thinking time felt shorter).
I would recommend against very low quantizations of GLM 5.0/5.1/5.2 or Kimi 2.5/2.6. Smaller models were more reliable, and therefore more useful.
embedding-shape 1 days ago [-]
I only have access to 96GB VRAM locally, but I'd agree with the general approach of avoiding lower quantizations, often anything below Q8 seems to suffer greatly on quality and seemingly never worth going below it, better to go for smaller model in that case.
With the exception of DwarfStar + DS4-Flash with IQ2_XXS quantization, which somehow seems to not suffer as much as I'd thought. I'd still opt for a smaller model + at least Q8.
verdverm 1 days ago [-]
I have tried llama-cpp, vllm is nicer (ray, handles queueing, doesn't have the cache invalidation bug for qwen/gemma models) and unsloth has toxic employees in their discord.
I've run 2 qwen/gemma @8bit with full context window side-by-side. Right now I have 4 models on my spark (qwen36moe, embedding, reranker, qwen3-1.7B) to support my markdown kb tool.
The setup is not as capable, but still good and gets better with models/algos. To me, it's more about the freedom to tinker, freedom from token bill anxiety, and potential right to compute should the government/oligarchy decides it gets to decide who can access which models.
woadwarrior01 1 days ago [-]
> unsloth has toxic employees in their discord
Would you mind elaborating on this?
verdverm 23 hours ago [-]
Sure,
I shared a project in their #research channel where I used their qwen36moe quant to refresh my PhD research. The channel had a topic that ended with something like "and all things research..."
One of their people accused me of self-promotion, and I reiterated that I shared it in that channel because it was their quant doing something (I thought) interesting as a research model. The number of people interested in the topic can be counted on your hands (in binary).
They remained accusatory, made it personal, and then started deleting messages. I suppose I escalated a bit (from their perspective), saying how this was not a good first encounter, they could have asked me to move it instead of just deleting it. Then they deleted every message, including all of their own, and put me in timeout. Erased from history, unable to participate, and so I left.
A coworker of mine (ML guy) is also sus about their quants, not nefarious, more that their benchmark results do not mean they are better, possibly skewed / benchmaxxed.
woadwarrior01 13 minutes ago [-]
Thanks for sharing.
treis 1 days ago [-]
2 LLMs at the same time? I've always wanted to do that
roger_ 1 days ago [-]
How about Qwen3.6? What sort of prefill/decode rates?
Edit: 3.6 not 3.7!
simonw 1 days ago [-]
So far there aren't any open weight model releases for the Qwen 3.7 family.
syhol 1 days ago [-]
> So far
Someone's optimistic
simonw 1 days ago [-]
I'm hoping the decision makers at Qwen notice how influential the 3.6 series is while the 3.7 series has had very little attention at all.
(Of course for all I know the 3.7 series is doing incredibly well in China, but I've seen almost no buzz around it from the circles that I inhabit.)
CamperBob2 24 hours ago [-]
My impression is that with the latest round of high-profile releases, the open-weight "market" is coalescing around two players, DS4 Flash for speed and GLM 5.2 for smarts. Qwen is being left behind to pick up the scraps for the terminally GPU-poor.
We know they have what it takes to fight back, and they know it... so I agree, there's no reason not be optimistic about future Qwen releases. But then I've never really understood what motivates these releases in the first place.
zozbot234 21 hours ago [-]
DeepSeek V4 Pro seems to have significantly lower overhead than GLM 5.2 for the same context size. If the two are about equally smart, that's not a very good look for GLM. E.g. the KV-cache storage for GLM at full context is significantly larger, which directly impacts the effectiveness of batching on memory-constrained hardware. Keep in mind that the existing DeepSeek Pro is a preview model, we might be about to see further iterations of it being released. Hopefully the GLM folks will pick up these techniques for GLM 6 or something, the model itself is quite nice after all. It's just noticeably harder to run on limited local platforms.
CamperBob2 4 hours ago [-]
If the two are about equally smart, that's not a very good look for GLM.
They aren't, though. GLM 5.2 is very far out in front of everybody else in the open-weight business when it comes to coding. They seem to have put a disproportionate effort into improving coding, and while it paid off for that, it does seems to have cost some efficiency.
You could say that GLM 5.2 is to DS4 as Fable is to Opus. Fable is is no better at a lot of tasks than Opus, but it codes like nothing else ever built.
simonw 22 hours ago [-]
Qwen still have the best models that actually run on a laptop - Gemma 4 is their best competition there.
zozbot234 21 hours ago [-]
That's only really true if one ignores the possibility of SSD offloading, which effectively opens up inference with far larger models. It's possible that the combination of batched inference and SSD streaming may be even more effective, though only for selected models with especially efficient KV storage, or perhaps very small inference contexts.
Myrmornis 24 hours ago [-]
The article was clearly written by an LLM. Please say so at the top.
cws_ai_buddy 1 days ago [-]
[flagged]
softwareseko 1 days ago [-]
[flagged]
jimmypk 1 days ago [-]
[flagged]
codelong888 24 hours ago [-]
[flagged]
devashish86 4 days ago [-]
Author here. Quick context the post doesn't quite spell out:
The tool_choice="auto" failure on Qwen3-Next isn't a parser issue — the model
reasons inside <think>, decides, and never emits the tool call. No error, just
empty tool_calls. The fix was swapping the backbone from Thinking to Instruct,
not tuning any parser flag.
The "load the bigger model first, size the smaller against actual residency"
playbook generalizes to anything with shared CUDA framework overhead. The ~5 GiB
framework floor shows up even at small gpu_memory_utilization values — plan
against actuals, not targets.
barrkel 1 days ago [-]
Can you try and tune your Claude or whatever LLM you're using for your text to phrase things in plain English. Way less use of antithesis, at least. You can probably find a skill for it, if not get an LLM to write your own.
reasonableklout 17 hours ago [-]
Yes, there are lots of obvious LLM tells that don't add value, like "the math has to be empirical, not aspirational", use of colorful technical language like "knobs" and "wiring", etc. It distracts from the content.
edg5000 1 days ago [-]
From the Codex system prompt (verbatim):
```
(...)
- Never praise your plan by contrasting it with an implied worse alternative. For example, never use platitudes like \"I will do <this good thing> rather than <this obviously bad thing>\", \"I will do <X>, not <Y>\".
- Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query.
(...)
```
It seems the OpenAI people added that first bullet to specifically address the tendency the model has, as seen in the parent comment. The goblin stuff coincidentally appears right after in the system prompt, so in included it as a bonus.
dofm 23 hours ago [-]
FWIW the mere fact that the goblin stuff is necessary and is in the system prompt suggests to me that OpenAI's approach of training ridiculously large models that can do everything for everyone is hopelessly cooked.
Though I concede it is not that much different than straightening the tie of your most valuable employee before you unwisely put them in front of a client and saying "please don't tell them about the regressions they didn't notice and remember, they don't want things explained in allegories drawn from the Silmarillion".
edg5000 11 hours ago [-]
Indeed. Maybe in the future labs will be more distinct in what they care about most (and what their model is best at), rather than trying to max out all benchmarks.
This may happen once we see finetuned GLM/Kimi/DeepSeek companies enter the market. I think it's not happening yet because of the hardware supply chain issues.
Rendered at 18:38:45 GMT+0000 (Coordinated Universal Time) with Vercel.
Now at 40-50tok/s generation and ~2000 tok/s prefill with a model that I've seen reason through race conditions and be able to trivially pull off any straight-forward coding task, and remain coherent at 500k context. With a preview checkpoint of the weights!
I'm excited for the future of local LLMs. There is some buy-in but apparently not an extreme amount to get access to models that can stand in the for the giants on all but the most challenging and/or hands-off coding tasks.
[1]: https://github.com/antirez/ds4
Not clear how you went from ~11-14 to ~40-50 tok/s. Is it by running the quant native model and adding a second Spark?
Cheers
Instructions to reproduce, and benchmarks here: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi...
Moving to 2 sparks meant switching to vLLM with 2-way tensor parallelism and working multi-token prediction. The parallelism and MTP on top of better tuned kernels[1] gave an extremely nice boost! I was quite pleased. I've seen bursts up to 60tok/s at ~150k context - sometimes the MTP seems to really kick in (i.e. high acceptance rate on its tokens)
Currently running a custom vLLM build put together by some folks on the Nvidia forums[2], which speaks to how early support for the model is.
[1]: https://github.com/lukealonso/b12x
[2]: https://forums.developer.nvidia.com/t/372268
I've had positive experiences running GLM 4.7 via vLLM, tool calling works well and the inference is fast. Do you run DeepSeek V4 Flash on vLLM?
Eventually, I'm going to stop writing stuff like this @dang, because even though it is literally being read by a human, it's going to just be copy and pasted into a chatbot, which will actually spend the time trying to comprehend what I am saying.
(emphasis mine)
> Instead of predicting just the next single token, DeepSeek-V3 predicts the next 2 tokens through the MTP technique. Combined with the framework of speculative decoding (Leviathan et al., 2023; Xia et al., 2023), it can significantly accelerate the decoding speed of the model.[2]
> As DeepSeek-V3, DeepSeek-V4 series also set MTP modules and objectives. Given that the MTP strategy has been validated in DeepSeek-V3, we adopt the same strategy for DeepSeek-V4 series without modification.[3]
[1]: https://arxiv.org/pdf/2412.19437#subsection.2.2
[2]: https://arxiv.org/pdf/2412.19437#subsubsection.5.4.3
[3]: https://arxiv.org/pdf/2606.19348v1#subsection.2.1
Side comment: I feel you may be too cynical towards your fellow commenters.
You draft n tokens, and you verify them in a single forward pass.
Here's the vLLM flag:
They may have only trained at a depth of 1, but boy-howdy, does that little MTP head do a pretty good of successfully predicting that second token about 60-80% of the time.It works great. I'll keep my increased performance, and
> so i don't know why you are punching these documents into the chatbot, and asking it questions about them, and then it gives you the wrong answers
you keep whatever this is. I posted direct quotes from their papers which say "it speeds up inference" (paraphrasing). I don't feel there is anything I can do to turn this into a good-faith discussion. Beep boop.
Assuming that’s not true based on your phrasing, you’d be shooting yourself in the foot. Start using online models with the same quant at least benchmark as what you could run at home. Prepare for the at home model to be slower.
You don't even need to go that far. For example, with Exoscale Dedicated Inference[1] you just point it at the Hugging Face for the model and quantisation you want to test and it automagically spits out an OpenAI-compatible API endpoint.
[1] https://www.exoscale.com/ai-cloud-infrastructure/dedicated-i...
(I have no relationship with Exoscale, this particular product just crossed my radar recently)
Well, yes, I understood that.
Which is why I started with the words "You don't even need to go that far.".
To re-phrase what I said in clearer terms:
Instead of renting an instance, then messing around with configuring Linux and whatever via SSH or Ansible or whatever. Just point a Hugging Face link at this magic service and get a ready-to-go API back. Enabling you to test your desired model spec with minimum fuss.
Ultimately the guy wants his own hardware. So why waste time messing around with someone else's VM if you just want to test a specific model spec. That is the TL;DR.
I very much want local AI to win this in the end, but it’s extremely expensive to run good models at good speed locally right now. Minimax M2.5/2.7, Qwen 3.6, etc are pretty good for basic stuff, but pretty far off from competing with Opus/Fable.
While experimenting with quantization I found that there is a non-trivial tradeoff between quality and memory footprint. Overall my experience follows the reported pattern of "2-bit is mwah, 4-bit half decent and 6-bit required for programming. Still, although MiniMax-m2.7 is useable with the 6-bit quantizations that unsloth provides, it felt like such a breath of fresh air when I used the reference full-size model.
I find it difficult to say why. I had mostly the same setup as before (parsing had to be slightly adjusted in Zed). Aside from not experiencing the thinking loops (where minimax would get stuck generating the same sentences over and over) there is little evidence of any real improvement (although the average thinking time felt shorter).
I would recommend against very low quantizations of GLM 5.0/5.1/5.2 or Kimi 2.5/2.6. Smaller models were more reliable, and therefore more useful.
With the exception of DwarfStar + DS4-Flash with IQ2_XXS quantization, which somehow seems to not suffer as much as I'd thought. I'd still opt for a smaller model + at least Q8.
I've run 2 qwen/gemma @8bit with full context window side-by-side. Right now I have 4 models on my spark (qwen36moe, embedding, reranker, qwen3-1.7B) to support my markdown kb tool.
The setup is not as capable, but still good and gets better with models/algos. To me, it's more about the freedom to tinker, freedom from token bill anxiety, and potential right to compute should the government/oligarchy decides it gets to decide who can access which models.
Would you mind elaborating on this?
I shared a project in their #research channel where I used their qwen36moe quant to refresh my PhD research. The channel had a topic that ended with something like "and all things research..."
One of their people accused me of self-promotion, and I reiterated that I shared it in that channel because it was their quant doing something (I thought) interesting as a research model. The number of people interested in the topic can be counted on your hands (in binary).
They remained accusatory, made it personal, and then started deleting messages. I suppose I escalated a bit (from their perspective), saying how this was not a good first encounter, they could have asked me to move it instead of just deleting it. Then they deleted every message, including all of their own, and put me in timeout. Erased from history, unable to participate, and so I left.
A coworker of mine (ML guy) is also sus about their quants, not nefarious, more that their benchmark results do not mean they are better, possibly skewed / benchmaxxed.
Edit: 3.6 not 3.7!
Someone's optimistic
(Of course for all I know the 3.7 series is doing incredibly well in China, but I've seen almost no buzz around it from the circles that I inhabit.)
We know they have what it takes to fight back, and they know it... so I agree, there's no reason not be optimistic about future Qwen releases. But then I've never really understood what motivates these releases in the first place.
They aren't, though. GLM 5.2 is very far out in front of everybody else in the open-weight business when it comes to coding. They seem to have put a disproportionate effort into improving coding, and while it paid off for that, it does seems to have cost some efficiency.
You could say that GLM 5.2 is to DS4 as Fable is to Opus. Fable is is no better at a lot of tasks than Opus, but it codes like nothing else ever built.
The tool_choice="auto" failure on Qwen3-Next isn't a parser issue — the model reasons inside <think>, decides, and never emits the tool call. No error, just empty tool_calls. The fix was swapping the backbone from Thinking to Instruct, not tuning any parser flag.
The "load the bigger model first, size the smaller against actual residency" playbook generalizes to anything with shared CUDA framework overhead. The ~5 GiB framework floor shows up even at small gpu_memory_utilization values — plan against actuals, not targets.
```
(...) - Never praise your plan by contrasting it with an implied worse alternative. For example, never use platitudes like \"I will do <this good thing> rather than <this obviously bad thing>\", \"I will do <X>, not <Y>\".
- Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query. (...)
```
It seems the OpenAI people added that first bullet to specifically address the tendency the model has, as seen in the parent comment. The goblin stuff coincidentally appears right after in the system prompt, so in included it as a bonus.
Though I concede it is not that much different than straightening the tie of your most valuable employee before you unwisely put them in front of a client and saying "please don't tell them about the regressions they didn't notice and remember, they don't want things explained in allegories drawn from the Silmarillion".
This may happen once we see finetuned GLM/Kimi/DeepSeek companies enter the market. I think it's not happening yet because of the hardware supply chain issues.