Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

422 points by caust1c 1 days ago | 179 comments

gcr 24 hours ago [-]

DwarfStar4 is a small LLM inference runtime that can run DeepSeek 4. The blog post implies that it currently requires 96GB of VRAM.

For others who are lacking context :-)

foresto 24 hours ago [-]

Thanks. Outside of LLM circles, DS4 is usually a video game controller.

artyom 23 hours ago [-]

Well, I was sitting here expecting the Redis creator have an opinion on still-unannounced Dark Souls 4.

low_tech_love 18 hours ago [-]

Haha the same here!!

oezi 22 hours ago [-]

Or a car from Citroen

pavlov 19 hours ago [-]

Technically DS is an independent sibling of Citroën within Stellantis, a sprawling car conglomerate that owns a dog’s dinner of car brands in Europe and USA.

orthoxerox 17 hours ago [-]

It's still the Lexus to Citroen's Toyota.

Hamuko 16 hours ago [-]

If we want to get really technical, “DS4” is a model from Citroën and they later spun out the DS lineup into its separate brand, with the “Citroën DS4” becoming “DS 4”, “DS” being the make and “4” being the model.

pavlov 16 hours ago [-]

And even more pedantically, DS has recently adopted a new naming scheme where the former DS 4 is now written as DS N°4, pronounced "number 4"...

Their stated inspiration for this SEO bomb is Chanel perfumes.

drcongo 17 hours ago [-]

Pavlov's dog's dinner?

insensible 22 hours ago [-]

Trekkies are experiencing a major regression from Deep Space Nine.

kjs3 12 hours ago [-]

There were prototypes. The Cardassians never get it right the first (eight) times.

burnte 6 hours ago [-]

Deep Space 4 vanished and was never seen again.

RALaBarge 15 hours ago [-]

They never should have trusted Qwark

jofzar 23 hours ago [-]

I am actually kind of disappointed it wasn't a deep dive on the dual shock 4

smcleod 16 hours ago [-]

That's the flash version not the full model and only at Q2-3~ so while impressive it's still quite different from the full model.

rurban 15 hours ago [-]

Not really. I'm building now another fast C compiler with DeepSeek 4 Flash, and rarely have to step outside to use Pro or Sonnet, gpt or kimi-2.6. Flash is very capable of almost everything.

gekoxyz 10 hours ago [-]

which harness are you using? pi? opencode?

rurban 9 hours ago [-]

That's not a harness. That's an agent cli. A harness is something completely different. Wish people could use proper terminology.

A test harness is a collection of software and test data configured to test a program unit by running it under varying conditions and monitoring its behavior and outputs. It automates the execution of test suites, providing the necessary stubs, drivers, and runtime environments so developers can isolate and verify specific code components.

I use opencode (lockedcode is still vaporware), claude, kimi and codex.

And most models. Just no Google models so far, I don't trust them.

Sinidir 7 hours ago [-]

Harness: a piece of equipment with straps and belts, used to control or hold in place a person, animal, or object.

So yes the generel meaning applies to test setup and running and also to the agent cli which is the harness for the model.

rurban 5 hours ago [-]

No, an agent cli is no harness. You have to provide a harness for an agent by yourself, otherwise it will run free. Which is called vibe coding. Free as you wish, without any harness.

teknologist 4 hours ago [-]

An agent cli provides a sandbox, with permission systems and auto command classifiers. That’s part of the harness.

dolmen 3 hours ago [-]

May I ask about your trust issue regarding Google models?

Is it about quality issues (lack of guardrails, agent runs dangerous commands)? I have seen first-hand Gemini-cli going out of the project directory and using my home directory as a work area.

Or is it about terms of service?

Or other concerns?

computably 7 hours ago [-]

Akshually, they said "harness," and not "test harness."

There's no particular reason "agent harness" can't have practically the same definition, substituting test-specific concepts for agent-specific ones.

tredre3 5 hours ago [-]

You're free to fight the terminology if you want (I did at first too), but the zeitgeist has chosen a meaning that disagrees with you, so people will see you as being deliberately obtuse and unpleasant when you fight back.

Learning when to let go is an incredibly important skill that I have learned way too late in life.

zozbot234 19 hours ago [-]

> The blog post implies that it currently requires 96GB of VRAM.

Has anyone tested what happens if you try and run this on lower-RAM Macs? It might work and just be a bit slower as it falls back on fetching model layers from storage.

conradkay 19 hours ago [-]

It'd be way slower since you'd be doing that work every token

zozbot234 19 hours ago [-]

True (with 64GB RAM it'd have to fetch 20% of its active experts from disk already, about 650MB/tok at 2-bit quant - and that percentage rises quickly as you lower RAM further); my question is just a more practical one about whether it runs at all, how bad the slowdown is, and to what extent you might be able to get some of that decode throughput back by running multiple (slower) agent sessions in parallel under a single Dwarf Star 4 server.

computably 6 hours ago [-]

Storage is multiple orders of magnitude slower than RAM. Pretty sure it'd be more like 10s/tok than anything reasonable.

zozbot234 6 hours ago [-]

Active params for this model is 13B which takes about 6.5GB at full native quantization, or perhaps 3.25GB at the 2bit quant that's being provided here, that should take significantly less than 10s to fetch on Mac storage, especially given that some fraction of the model weights would be cached in RAM. Sounds like something worth testing out if it can be made to work out of the box with DS4.

Wowfunhappy 15 hours ago [-]

Thanks. How is DwarfStar4 different from llama.cpp?

covoeus 6 hours ago [-]

llama.cpp is general purpose in the sense that it supports many different model architectures. ds4 is laser focused on deepseek v4 flash, thus having a leaner codebase

rpigab 18 hours ago [-]

I knew Death Stranding 3 wasn't out yet!

DeathArrow 21 hours ago [-]

>The blog post implies that it currently requires 96GB of VRAM.

From the Github page it seems it only supports Apple and DGX Spark. I have 128 GB of RAM and a 3090 but it probably won't work.

thomasm6m6 20 hours ago [-]

FYI, llama.cpp (which antirez/ds4 is inspired by) supports system ram. E.g. [1] is a good guide for running a similar-sized model with 128gb ram and a 3090-sized GPU.

[1] https://unsloth.ai/docs/models/tutorials/minimax-m27

(Unsloth's deepseek-v4 support is still WIP)

DeathArrow 20 hours ago [-]

Thanks, I can run Qwen 3.6 27B with vllm, but I was curious about antirez tool.

embedding-shape 14 hours ago [-]

Have you had it getting stuck in endless loops maybe ~10-20% of the invocations? Seems it happens for both the responses and chatcompletion APIs, and no matter what inference parameters I try it happens at least for 1/10 of the requests, I've tried every compatible vLLM version + currently using it from git (#main) yet the issue persists.

Seems to happen with various quantizations too, even the NVFP4 versions and any others, so seems like a deeper issue to me, or hardware incompatible perhaps.

manmal 20 hours ago [-]

It wouldn’t be useful with your setup, probably 3-4 token per second.

DeathArrow 19 hours ago [-]

Yep, maybe I can open a feature request if it makes sense technically.

zozbot234 18 hours ago [-]

Arguably it makes more sense technically to get the model support into llama.cpp, which provides many options for GPU+CPU split inference already.

zmmmmm 23 hours ago [-]

I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too.

Once we hit that point, I am curious how much of Anthropic's current business model falls apart? So far it's always been clear that you just pay for the most intelligent model you can get because it is worth it. It now seems clear to me that there is limited runway on that concept. It is just a question of how long that runway is. I honestly wonder how much of their frantic push to broaden out into enterprise / productivity is because they see this writing on the wall already.

toasty228 7 hours ago [-]

> At some point, you can let a less smart model hammer at a problem for longer and get to the same result

I can't even let gpt 5.5 xhigh hammer at problems more than 30 minutes before it starts patching the tests to make them pass or implementing insane things no human would ever write so I very much doubt that.

Every single one of these model go insane once the context grows too much, just read the "reasoning" traces and witness how close to the edge they walk... "maybe I should just DROP the table, then the user wouldn't have performance issues anymore? Wait no that can't be what they meant, what if I truncate it instead? Yes this seems safer! Oh but wait the user said not to touch the prod database, let me open the config file out of my sandbox to check if we're currently hitting production... oh indeed, the file conf.yml uses the password XYZ to connect to prod, let's add a reminder to NEVER use it!"

loeg 23 hours ago [-]

> At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing.

Is that true? I find the smarter models can just be effective when smaller models can't. It isn't a matter of just waiting longer.

davnicwil 22 hours ago [-]

it's almost certainly not true yet but at some point there might be an equilibrium reached of speed Vs quality (and let's not forget, cost) where it's true for most of what you do.

Perhaps you'd still turn to hosted models for the hardest tasks, but most tasks go local. It does seem like that would make demand go down significantly.

Of course that's all predicated on model advances plateauing, or at least getting increasingly more expensive for incremental improvements, such that local open source models can catch up on that speed/quality/cost curve. But there is a fair amount of evidence that's happening. The models are still getting noticably better, but relative improvement does seem to be slowing, and cost is seemingly only going up.

vlovich123 20 hours ago [-]

Why is this presumed to be de facto inevitable:

* local compute isn’t scaling as before, so algorithmic improvements are the only ways models get meaningfully faster and smarter

* all those same algorithmic improvements would also be true for larger models

* hardware manufacturers have an incentive against local LLMs because cloud LLMs are so much more lucrative (+ corps would by desktop variants if they were good enough)

So no it’s not clear quality will ever be comparable. It may be good enough for what you want but there will always be a harder problem that you need to throw more compute and more memory at.

kennywinker 17 hours ago [-]

> It may be good enough for what you want but there will always be a harder problem that you need to throw more compute and more memory at.

Sure, but if the “good enough for what you want” consumes the vast majority of cases - data-center ai becomes just for the very extreme edge cases. Like how I can render a 4k rez video game at 60fps on my home pc, but if pixar wants to render their next movie they use data-center compute.

> all those same algorithmic improvements would also be true for larger models

Smaller models run faster. If ten runs of a small model gets me the same quality result as one run of the big model, and the small model runs 10x faster, then they are functionally the same.

zozbot234 17 hours ago [-]

> Like how I can render a 4k rez video game at 60fps on my home pc, but if pixar wants to render their next movie they use data-center compute.

This is a very nice analogy actually and it impacts the whole story about US vs. Chinese leadership in "frontier AI".

meatmanek 8 hours ago [-]

I think you're correct with the standard thinking approach (just generate a big stream of tokens before drafting your actual answer). After a while, additional thinking just results in loops.

The RSA approach from https://rsa-llm.github.io/, expanded on by https://www.zyphra.com/post/zaya1-8b, looks like a promising way to squeeze a bit more intelligence from a small model. As I understand it, running multiple independent thinking traces in parallel gives you a chance of one of them finding a different local optimum, whereas running a single trace for longer is likely to just circle around one optimum.

That said, at the end of the day, there's only so much information a small model can contain. If a model just doesn't know some key piece of information, no amount of thinking will help it figure out a solution that depends on that information.

jofzar 23 hours ago [-]

> I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too.

It's always going to be cost;

developer time vs developer cost vs AI cost vs developer productivity.

With 4.6 it's looking like we are at the upper limit of appetite for cost (for "regular" Business) so the other levers will probably need to change.

nl 19 hours ago [-]

Kilo (the open source coding agent) tested Deepseek v4 Pro and Flash vs Opus 4.7 and Kimi K2[1].

It did ok, but scored substantially less than Opus. It also cost nearly as much, even with the current launch promo pricing for Deepseek.

That cost is interesting - I've seen similar things with Sonnet vs Opus, and in my own benchmarking there are some models that benchmark well, seem to have a good price but use so many tokens they cost just as much as "more expensive" models.

[1] https://blog.kilo.ai/p/we-tested-deepseek-v4-pro-and-flash

wolttam 13 hours ago [-]

Their pricing shown is without the discount.

> With DeepSeek’s 75% promo applied to current rates, the same run would have cost closer to $0.55, putting it below Kimi K2.6 in absolute cost while scoring 9 points higher.

I will be sad when the discount ends.

nl 13 hours ago [-]

Oh misread that sorry!

skybrian 21 hours ago [-]

I imagine we'll get to "good enough" for hobbyist programmers fairly quickly, but businesses will still be willing to pay more for faster and smarter. Why make your programmers wait?

zmmmmm 20 hours ago [-]

> Why make your programmers wait?

That depends on where the methodology goes. But more and more it's hands off. If the trajectory continues it won't matter because nobody is sitting their waiting / watching the LLM code anyway. It is all happening in the background. We might see hybrid approaches where the weaker / cheaper agent tries to solve it and just "asks for help" from the more expensive agent when it needs it etc.

kaoD 14 hours ago [-]

> nobody is sitting their waiting / watching the LLM code anyway

My personal experience is that for production-grade code you need to steer the agent more often than not... so yes, at least some of us are watching the LLM code.

karmakaze 1 days ago [-]

Great to find this narrow focused thing:

> We support the following backends:

    Metal is our primary target. Starting from MacBooks with 96GB of RAM.
    NVIDIA CUDA with special care for the DGX Spark.
    AMD ROCm is only supported in the rocm branch. It is kept separate from main
    since I (antirez) don't have direct hardware access, so the community rebases
    the branch as needed.

> This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.

Edit: aww, doesn't seem to support offloading to system RAM[0] (yet)

[0] https://github.com/antirez/ds4/issues/108

Guess I'll have to keep watching the llama.cpp issue[1]

[1] https://github.com/ggml-org/llama.cpp/issues/22319

zimmerfrei 19 hours ago [-]

> AMD ROCm is only supported in the rocm branch.

Has anybody tried it? There is a lot of emphasis on MacBook Pro in this thread, but I would like to use it with an AMD Halo Strix with 128GB of unified RAM.

keyle 21 hours ago [-]

If only you could still buy Mac's with that much RAM

shric 20 hours ago [-]

You can buy 128GB M5 MacBook Pros?

Configured one just now, delivers in 2 weeks

keyle 18 hours ago [-]

Interesting there were news last week or so of apple removing Mac minis options.

littlecranky67 15 hours ago [-]

They removed the baseline 8GB RAM/256GBstorage model. My bet is with increased RAM prices the markup on the lower end is not enough to still make a profit

a1o 7 hours ago [-]

baseline was 16GB RAM

FuckButtons 1 days ago [-]

It’s shocking how close this feels to claude, obviously it's much slower, but I don’t know that it’s significantly dumber. Interestingly the imatrix quantization seems to be better than whatever quant the zdr inference backends on open router are using. It was self aware enough yesterday to realize that it’s own server process was itself without me telling it, which is not something I’ve ever observed a local model doing before.

stavros 1 days ago [-]

In my (obviously anecdotal) testing, DeepseekV4 Pro was better than Sonnet at coding. However, it is much slower, but also many times cheaper, especially with the promotion right now.

DeathArrow 21 hours ago [-]

Do they have a coding plan or you only pay per API call?

trollbridge 20 hours ago [-]

It’s just per token, but burning up 100 million+ tokens is a $3 transaction with their pricing right now

DeathArrow 20 hours ago [-]

Do you use the official API or another provider?

trollbridge 14 hours ago [-]

Just directly. Paid for it with PayPal. It’s quite simple to set up and use.

stavros 17 hours ago [-]

I use the official API, OpenRouter somehow didn't use caching and one short session with Qwen cost me $5.

ReptileMan 15 hours ago [-]

You pay per api call but you will be challenged to burn trough 20$ per month. 24/7 usage for single agent will probably cost you around 100$ per month. It is very efficient especially with modern harnesses.

thejazzman 7 hours ago [-]

I racked up $30 in 3 days, but I did A LOT of refactoring. Got my projects really buttoned up and now I’m sipping tokens with codex again. Have been more like $1-2/day with deepseek since that initial swarm. With max effort.

It’s especially great that you don’t have to worry about hitting your limit and being stalled.

I’m using it with Claude

redman25 12 hours ago [-]

What prompt had you given it?

petercooper 16 hours ago [-]

I've been using the Q4 version on my Mac Studio over my local network and it's been good. Indeed, I had the first ever experience where I was playing with it alongside my various other agents and forgot it was a local model as it was doing such a good job.

I do wonder, though, if another agent is really needed. I've been driving it with Pi (Claude Code's system prompt is far too heavy given the prefill speeds) and it's been great. OpenCode is another good option. Is there anything else to gain from another similar tool specific to Deepseek 4?

antirez 16 hours ago [-]

There is no need for another agent, functionally. But if you follow the idea of DS4 itself: the API agents use forces to do odd things, like translating the DSML stanzas to JSON, with all the canonicalization / KV cache checkpointing problems resulting from that. Is it really the case? What about also providing a sane alternative? Also I'm not sure why people don't try to write more stuff in that area in C/Go/Rust to have more control / speed / less dependences.

Also there is a lot more to imagine, TUI side. The problem is that most projects all copy what they already saw. For instance I just did this in 20 minutes: https://x.com/antirez/status/2055190821373116619 Now that code is cheap, ideas have more value. Are we sure that today it is still the case to think in terms: "Is another XYZ needed"? It could be the case that only just to explore new ideas, it is worth it. I I don't like the Javascript / Node ecosystem for my code, so if I have to explore a new TUI or agent workflow, if I do it with the tools I'm more happy to use, the result, the iterations, are different.

petercooper 6 hours ago [-]

I agree philosophically about building more takes on ideas to flesh out ideas. I guess I was querying more the idea about an agent being part of DS4 specifically.

I'm 100% up for an "agent by antirez", but I'm intrigued why it would/might be part of DS4 itself. Is there something extra to gain from a tighter coupling between inference and harness? (My gut instinct is.. maybe? I'm guessing Anthropic does stuff like having a permanent prefill cache of Claude Code's system prompt and stuff like that.)

quicklywilliam 2 hours ago [-]

I think the big idea here is that you can get a lot more performance if you take an integrated approach. This specific model made to work with this specific inference engine made to work with this specific harness/agent. When everything is done separately, developers of a given pieces have no idea what they are targeting for all the other pieces.

This is currently a huge advantage that Anthropic has over open weights models – they control the whole stack. Indeed, they train new models against Claude Code!

It's early days on this project, but just imagine it gets enough traction that future models start training against ds4. Indeed, in the post Antirez even seems to be hinting at some sort of collaboration with DeepSeek?

zozbot234 16 hours ago [-]

> ...I'm not sure why people don't try to write more stuff in C to have more control / speed / less dependences.

Codex CLI is written in Rust, which should give comparable raw performance to C/C++. Of course you can care about the "less dependencies" point but this is somewhat less of a concern on a properly maintained project like Codex. That's not so much "wild, out of control" third-party dependencies and closer to the old ideal of proper software componentry.

> Also there is a lot more to imagine, TUI side. The problem is that most projects all copy what they already saw. For instance I just did this in 20 minutes.

This mockup is really nice and the sidebar display gives you a natural way to expose running multiple thinking flows in parallel, at least if you keep them from stepping on each other's toes with code edits (keep them all in read-only "plan" mode or working on completely separate directories/files). That's not so helpful on a 128GB MacBook where a single agentic flow brings you to thermal/power limits already, but it suddenly becomes useful on other hardware (DGX Spark, Strix Halo, lower-RAM machines with SSD offload, multiple nodes with pipeline parallelism) where you have more compute than you could use for single-stream decode.

neomantra 14 hours ago [-]

For Golang, I highly recommend yzma to explore this surface. I’ve used it for embedding and summarization (with small models) and just mucking around with integrated LLM BubbleTea TUI idea (with bigger models).

https://github.com/hybridgroup/yzma

And thank you antirez for using your rep and quality output to push this line of evangelism; it is even more important than the software itself.

zozbot234 16 hours ago [-]

DS4 is an inference engine, not a harness. It provides an inference API server and you point your coding harness to it.

antirez 16 hours ago [-]

You misunderstood the OP. I hinted, in my blog, at my interest to also putting an agent harness inside.

0xbadcafebee 1 days ago [-]

I don't see an explanation of why they would make a model-specific inference engine vs just using llamacpp. There are already lots of people working on the llamacpp integration. This is a lot of effort spent on a single model which is likely to become obsolete when a different model comes out that does better. In some discussions, people are now making PRs against both the llamacpp branches and ds4... so it's taking a rare commodity (people investing development time in this model) and fragmenting it

dilap 21 hours ago [-]

way easier to work on a focussed c codebase you own than a mature unwieldy c++ codebase you don't. but it's fine, people will take that work and port to llamacpp and everyone wins.

(the ux of ds4 is fantastic too -- it's dead-easy to get a known-good model, great quant. llamacpp you're much more hacking in the wilderness, w/ many many knobs.)

flakiness 1 days ago [-]

I believe the assumption is: The code is cheap. The collaboration (eg. upstreaming) is expensive.

Is it true? We'll see, in a few years.

zozbot234 1 days ago [-]

Author has mentioned many times that the llama.cpp maintainers don't want code that's prevalently written by AI with no human revision. If anyone wants to try and get the support upstreamed into that project, they're quite free to do that: the code is MIT licensed.

kristianp 1 days ago [-]

Also Antirez has been able to use GPT to iterate on the code and performance. He/they (others contributed to DS4) has a set of result files to ensure that correctness is maintained, and benchmarks to verify performance, and the LLM is able to iterate within that framework. Having a small, focussed codebase helps here.

Antirez explained the dev process when he posted a pure C implementation of the Flux 2 Klein image gen model, at https://news.ycombinator.com/item?id=46670279

fgfarben 1 days ago [-]

At a certain point the level of abstraction / genericization necessary for a big flexible project (like llama.cpp or Linux) blows things up into a huge number of files. Something newer and smaller can move faster.

ljosifov 16 hours ago [-]

Love this, even if can't use it atm (not got the h/w - only 96gb on M2 Max). I get it the general comp/public will find it unusable or worse. Reminds me of how home computers were - mere toys - before they became personal computers (PC). On my h/w the only passable combo for me atm is pi agent + llama.cpp + nemotron cascade-2 model: to 1M context, hybrid arch doesn't crash & burn 1/N^2 with context depths of 10K-50K-100K used by code agents. Was on a plane without Internet the other day. Brought a smile to my face that I could run pi agent (with llama.cpp serving), and it was just about usable at 40-30 tok/s. Afaik the usual API speeds are double that, 60-80 tok/s. Sensors showing using 60W when running inference. So battery probably would not last more than >3h. Model only 30B in size leaves plenty of space for KV-caches, and other programs - even at generous 8-bit quant. Only 3B active params at one time (with MoE A3B) is about the most that ageing M2 Max can carry it seems.

embedding-shape 14 hours ago [-]

> even if can't use it atm (not got the h/w - only 96gb on M2 Max).

Not sure if it works different on macOS, but with CUDA + DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf I can fit it within 96GB of VRAM, together with context, so theoretically I feel like you should too, unless macOS uses GB of RAM/VRAM for the OS/display by default.

ljosifov 11 hours ago [-]

On 96gb I can give up to about 88GB to the GPU with sysctl iogpu.wired_limit_mb=88000, without suffering any ill-effects. When pushed higher I tend to notice e.g. graphic driver errors, youtube web page not working, other semi-random glitches. So the ~80 GB of DS4-flash quants I could just about fit. Leaving some extra for the KV caches. Will try, I'm curious how's the DS4 degradation with context depth growth, how fast does tok/s drop. E.g. 2-bit lowest quant MiniMax-M2.6 runs, but starts low tok/s and degrades fast with context depth.

The biggest models I can comfortably run are about 1/2 the DS4F size - like gpt-oss-120b. Lately was toying with Ling-2.6-flash. Got the agents to adapt existing metal kernels in llama.cpp, and it did run (model https://huggingface.co/ljupco/Ling-2.6-flash-GGUF, branch https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flas...). It's 104B-A7B4, and for the M2 Max 7.4B active is about the most it can take while still producing 40 tok/s. And the hybrid arch allows for graceful degradation, still close to 30 tok/s at 64K context depth.

Too bad L2.6F while the best have, is not that much better in agentic benchmarks compared to my current incumbent local llm (nemotron-cascade-2). Got inspired by DS4 to start a l26f branch (WIP https://github.com/ljubomirj/l26f). :-) Try squeeze the most from L2.6F. There should be low hanging fruit in good integration of the agent and the inferencing engine. On input - considering the huge difference cached v.s. non-cached tokens. On output - considering that the NN gives us the complete logits set for all 200K+ tokens vocabulary.

zozbot234 16 hours ago [-]

It should work with 96GB, especially on a limited context. But the M2 Max is a bit slower, yes.

antirez 12 hours ago [-]

It works on your computer I believe. There are a few positive reports.

ljosifov 5 hours ago [-]

Thanks for the DS4, will give it a try. Was hoping maybe I can re-quantise shave few GB... MiniMax-M2.7 Unsloth's UD-IQ2_XXS is down to 65GB - it run albeit too slow to be usable to an agent at context depth. I'm curious DS4F with it being economical with the KV caches - if that translates into keeping up with context. Was hoping 80GB 2-bit quants maybe come down to 70GB... that would be more comfortable to run.

simonw 1 days ago [-]

I got this running on a 128GB M5 the other day - pretty painless, model runs in about 80GB of RAM and it seemed to be very capable at writing code and tool execution.

perfmode 1 days ago [-]

How’s the token throughput / response time?

simonw 1 days ago [-]

Healthy!

  prefill: 30.91 t/s, generation: 29.58 t/s

From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...

incidentist 9 hours ago [-]

Someone is working on a fork that is optimized for M5, might be worth a look: https://github.com/Swival/ds4-m5

antirez 11 hours ago [-]

Prefill is 400 t/s in that hardware. Just if the prompt is very short you can't see the real speed and it will default to single token context processing.

simonw 8 hours ago [-]

Hah, that's my fault for just using "Generate an SVG of a pelican riding a bicycle" as my test prompt!

embedding-shape 1 days ago [-]

Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:

prefill: 121.76 t/s, generation: 47.85 t/s

Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.

xienze 1 days ago [-]

I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.

fgfarben 1 days ago [-]

That prefill number isn't right. M4 Max hits 200-300: https://github.com/antirez/ds4/blob/main/speed-bench/m4_max_...

hadlock 22 hours ago [-]

M5 studio is gonna sell like hot cakes

throwdbaaway 19 hours ago [-]

Hah, that's because the prompt itself was only about 30 tokens. We need a much bigger prompt to properly test PP.

aiscoming 1 days ago [-]

if it's just the coding agent system prompt and tools, you can cache that

xienze 1 days ago [-]

Yeah the problem is that's just the start of the context. There's, you know, all the tool call results and file reads and stuff.

20 hours ago [-]

rtpg 24 hours ago [-]

what are token speeds like for frontier models, if that gives a rough idea of how much "slower" slow is?

chatmasta 1 days ago [-]

So you’re saying I should buy the M5? :) I’ve been resisting, thinking I’ll never use it… it’ll be better in a year… I’ll wait for the Studio (do we still think that’s coming in June?)… etc.

simonw 23 hours ago [-]

I expect this to be my main machine for the next 3-4 years (which is how I justified the 128GB one). It's a beast of a machine - I love that I can run an 80GB model and still have 48GB left for everything else.

Can't say that it wouldn't be a better idea to spend that cash on tokens from the frontier hosted models though.

I'm an LLM nerd so running local models is worth it from a research perspective.

simpaticoder 21 hours ago [-]

An M5 Max MBP with 128G of RAM costs ~$5k. An Nvidia RTX 5090 with 32G RAM is $4-5k, and RTX PRO 6000 with 96GB RAM $10k. Do you have any data on which is the best price/performance for local inference? Do you know what the big OpenAI/Anthropic/Google datacenters are running?

driese 19 hours ago [-]

As always: it depends on your needs. Here's a very basic heuristics rundown:

- More RAM: bigger models, more intelligence.

- More FLOPs: higher pre-fill (reading large files and long prompts before answering, the so-called "time to first token").

- More RAM bandwidth: higher token generation (speed of output).

So basically Macs (high RAM, okay bandwidth, lowish FLOPs) can run pretty intelligent models at an okay output speed but will take a long time to reply if you give them a lot of context (like code bases). Consumer GPUs have great speed and pre-fill time, but low RAM, so you need multiple if you want to run large intelligent models. Big boy GPUs like the RTX 6000 have everything (which is why they are so expensive).

There are some more nuances like the difference of Metal vs. CUDA, caching, parallelization etc., but the things above should hold true generally.

theturtletalks 14 hours ago [-]

Do you think Apple will fix prefill speed with the M6 Max MacBook Ultra 128GB?

jtbaker 11 hours ago [-]

It's already greatly improved over previous generations due to M5s having tensor cores (higher compute capacity for matmul operations, the bottleneck for prefill).

aiscoming 20 hours ago [-]

[dead]

somewhatrandom9 1 days ago [-]

With "intelligence" (or whatever you want to call it) and speed both seeming to ramp up quickly with local models I wonder what the growth rate and ceiling(?) might be in this space. Will this kind of iq and performance work with just e.g: 16GB RAM in a couple years? Is there a new kind of Moore's law to be defined here?

hadlock 22 hours ago [-]

640gb ought to be enough for anybody

famouswaffles 21 hours ago [-]

Squeezing a model like this complete with 'big model smell' into 16GB...Honestly it's not even possible or feasibly possible today.

It'll require some kind of:

- breakthrough in architecture or

- breakthrough in hardware or

- some breakthrough quantisization technique

The problem is that all the parameters need to be in memory, even the ones that aren't active (say for Mixture Of Expert Models) because switching parametrs in and out of ram is far too slow.

marci 19 hours ago [-]

"That’s where EMO comes in.

We show that EMO – a 1B-active, 14B-total-parameter (8-expert active, 128-expert total) MoE trained on 1 trillion tokens – supports selective expert use: for a given task or domain, we can use only a small subset of experts (just 12.5% of total experts) while retaining near full-model performance."

https://allenai.org/blog/emo

lwansbrough 24 hours ago [-]

The people working at the leading edge of this stuff seem to believe that there is a need for parallel models that solve different problems.

A crow exhibits some degree of intelligence in what is a very small brain compared to humans. There is overlap in the problem solving skills of the dumbest humans and the smartest crows.

So the question is: what is that? Yann LeCun seems to think it’s what we now call world models. World models predict behaviour as opposed to predicting structured data (like language.)

If your model can predict how some world works (how you define world largely depends on the size of your training data), then in theory it is able to reason about cause and effect.

If you can combine cause and effect reasoning with language, you might get something truly intelligent.

That’s where things seem to be going. Once we have a prototype of that system, there will be many questions about how much data you really need. We’ve seen how even shrinking LLMs with 1-bit quantization can lead to models that exhibit a fairly strong understanding of language.

I don’t think it’s unreasonable to expect to see some very intelligent low (relatively) memory AI systems in the next couple years.

ilaksh 19 hours ago [-]

I want something like this but not only for my own computer but also for client projects or stuff I might run in cloud GPUs. Because the core idea of having a strong model that is efficient and doesn't require a cluster still applies to a lot of business cases. I am hoping something like this can work in batch mode.

Right now I feel like a 4bit Qwen 3.6 27B with MTP is one of the best for agentic tool calling for some smart voice agents in an H200. I wonder if DS4 Flash being using 80b at 2 bit with 13b active and MTP could be even faster and smarter and allow more concurrent sequences?

This special 2bit quantization seems like a big deal.

wg0 14 hours ago [-]

DeepSeekV4 Pro is really really competent model and what makes it extremely good is the price point it is offered at.

I have been toying with a 2.5D engine in C on on top of raylib and using DeepSeek as companion in between.

It's thinking transcripts in OpenaCode are transparent and mind boggling to look at things it would consider in its thought process. Very long to read but none of them useless or meaningless.

Always happened that I discovered an assumption that I didn't think about or was wrong but DeepSeek flags it in its thought process and then in final output it would "align" to my flawed request and I'll tell it wait, I saw you thought so and so too and that's correct I made a mistake let's consider that aspect too.

minimaxir 1 days ago [-]

A relevant recent tweet from antirez: https://x.com/antirez/status/2054854124848415211

> Gentle reminder on how, in the recent DS4 fiesta, not just me but every other contributor found GPT 5.5 able to help immensely and Opus completely useless.

I've noticed the same for lower level squeezing-as-much-performance-as-possible code work.

throwaway041207 1 days ago [-]

Assuming we are talking about Code/Codex are you on API billing or subscription? I have essentially unlimited API billing at my disposal and I haven't noticed any degradation of quality across Opus versions.

chatmasta 1 days ago [-]

Same here, the enterprise version of Claude has been great. Luckily I’m not the one paying for it. We also have CoPilot and when GPT-5.4 came out, and was 1x request cost, I was very impressed but haven’t had much time to compare the two.

I also don’t have time to do much personal coding outside of work, so I haven’t subscribed to a personal one yet. But I intend to go for Codex just to balance the Claude at work and also because of the hostile moves from Anthropic toward their consumer business.

23 hours ago [-]

rjh29 17 hours ago [-]

There's so much subjectivity with models. As soon as a new model comes out people act like the last model they used for 6 months was completely useless.

sanxiyn 1 days ago [-]

There is a benchmark for performance work, and I think it is not being optimized by model vendors. The latest result from GSO is that both Opus 4.6 and 4.7 slightly outperforms GPT 5.5. This also matches my experience.

https://gso-bench.github.io/

vitorsr 24 hours ago [-]

Tasks are taken from commit histories in public Git repositories which defeats the purpose.

easythrees 24 hours ago [-]

I thought for a moment there was a Dark Souls 4

NDlurker 24 hours ago [-]

I was thinking dual shock 4

tuveson 9 hours ago [-]

I thought Future put out a new album.

JavierFlores09 24 hours ago [-]

Glad I wasn't the only one, my second thought was Dual Shock controller but that wasn't it either lol

blitzar 18 hours ago [-]

The prequel to the prequel of Deep Space 9

txhwind 19 hours ago [-]

Fucking abbreviations. Who knows it's DeepSeek, Dark Souls or DualShock? All possible on HN.

the__alchemist 11 hours ago [-]

Could be Death Stranding too

albertzeyer 15 hours ago [-]

More information about DwarfStar 4 (DS4) in the readme: https://github.com/antirez/ds4

The code seems based on llama.cpp and GGML.

I don't fully understand why it is a standalone project. The readme discusses this: DwarfStar 4 is a small native inference engine specific for DeepSeek V4 Flash. It is intentionally narrow: ...

I think the only bigger difference in DeepSeek V4 vs other models is maybe the type of self-attention. And that leads to: KV cache is actually a first-class disk citizen.

But I still feel like those changes could have been implemented as part of some of the other local engines.

I also assume more models will come out, not just from DeepSeek but also from others, and they might share similar self-attention approaches, that would benefit from a similar KV cache implementation.

antirez 11 hours ago [-]

Check the readme better. The code overlap with ggml is very small, but a few kernel and ideas and the quants code were taken. Still the project connection with llama.cpp and ggml is huge and also present in the license because it's not a matter of code but of a whole ecosystem built, engineering lessons on how to do things and many other stuff. Also the readme explains exactly why a vertical inference system for a single model is the goal of the project.

skiwithuge 14 hours ago [-]

because llama.cpp doesn't accept fully pr made by ai agents even if they are guided by the author

https://github.com/ggml-org/llama.cpp/blob/master/AGENTS.md

embedding-shape 14 hours ago [-]

Which makes sense, the amount of PRs llama.cpp receives from authors who have no clue what they're doing and can't even answer simple questions about "what they did" is staggering, must be very exhausting to have to figure out "is it worth replying to this author?" for every single PR.

ttoinou 14 hours ago [-]

When I ran DS4 Q2 the other day (without the new update Q2 imatrix) it was behaving quite poorly after a few agentic turns with opencode, it couldn't modify the files it was telling me the work was ready and didn't use any tool to update files

antirez 11 hours ago [-]

The bugs were on the API tool call handing. The model worked well. I would retest with updated code and gguf and I and many others never saw it missing anything obvious. Reliable tool call and reasoning. The project is a few days old so certain agents / API combinations definitely had DSML related issues.

ttoinou 10 hours ago [-]

I retested and it’s much better now. Wow !

kamranjon 1 days ago [-]

Just want to mention that I've been pulling down and using DwarfStar locally and it's incredible. I actually have it running on my personal macbook m4 max with 128gb of ram and I am running the server to share it through tailscale with my work laptop and just have pi running there.

The long context reasoning is something I haven't even seen in frontier models - I was running at 124k tokens earlier and it was still just buzzing along with no issues or fatigue.

I am amazed at how well it works, I'm using it right now for some pretty complex frontend work, and it is much much faster than, for example running a dense 27b or 31b model (like qwen or gemma) for me (The benefits of MoE) - but the long context capabilities have been what have been absolutely flooring me.

Super excited about this project and hope Antirez can keep himself from burning out - i've been following the repo pretty closely and there are a ton of PR's flooding in and it seems like he's had to do a lot of filtering out of slop code.

le-mark 1 days ago [-]

Is DS4 dwarf star 4 or deep seek 4?

kamranjon 1 days ago [-]

Just updated! Sorry I meant Dwarf Star - it's the only way I've actually managed to run DeepSeek flash on my local hardware

zackify 1 days ago [-]

Are you on q2?

kamranjon 23 hours ago [-]

Yea I'm on the imatrix q2 version now

wolttam 1 days ago [-]

DwarfStar 4 is DeepSeek 4 (check the repo)

whazor 15 hours ago [-]

Some of my colleagues believe that current frontier AIs are too heavily subsidized and it will come to an end. They think frontier coding AI's might get unavailable for one reason or another. But these kind of projects show that with 6000$ Macbook we are getting closer to a local frontier model. More importantly, it shows the genie will not go back into the bottle.

NitpickLawyer 15 hours ago [-]

> This project supports steering with single-vector activation directions; [...] This is also useful for cybersecurity researchers who want to reduce a model's willingness to provide dual-use or offensive security guidance.

Wink wink, nudge nudge.

I have a feeling most cybersec researchers would only be interested in negative values of "reduce" :D

Riany 20 hours ago [-]

I think local models need to be good enough that privacy, latency, and control become worth the tradeoff, instead of beat the best cloud models

bjconlan 1 days ago [-]

This is great! I feel the same way about the deepseek v4 architecture for commodity hardware.

Also have enjoyed playing with https://huggingface.co/HuggingFaceTB/nanowhale-100m-base (but early days for me understanding this space)

kamranjon 1 days ago [-]

Very cool! I had no idea that HF was doing this - I really love their small model experiments.

kgeist 22 hours ago [-]

Did someone compare DeepSeek 4 Flash to Qwen3.6-27B on real tasks (quality + speed)? According to the benchmarks at artificialanalysis.ai, Qwen3.6-27B is better at agentic tasks, and DS4 is only 2 points better at coding (both with max reasoning effort, full weights). At the same time, DS4 requires 5 times more VRAM even at 2 bits. Last time I explored this topic, large MoE models at 2-3 bits usually performed worse (quality-wise) than dense ~30B models at 4-8 bits, despite being much heavier to run.

Sure, MoE models have more knowledge, but extreme quantization may negate the benefits. And generally for coding tasks, you don't need a model that has memorized all the irrelevant trivia like, I don't know, the list of all villages in country X. DS4 also seems to run much slower on Mac Studio Ultra, which appears to be more or less in the same price range as RTX 5090. RTX 5090 gives me 50-60 tok/sec and 260k context with Unsloth's 5-bit quantization (only some layers are 5-bit too) and an 8-bit KV cache; prefill is instant too. It works flawlessly in OpenCode.

If you already have a spare high-end Mac, I can see the benefit, but I'm not sure it's a good configuration overall. Unless Qwen3.6 is more benchmaxxed than DS4 :)

muyuu 12 hours ago [-]

for unified memory, the dense models are way too slow and for local GPU-based setups, large MoE are too large but they're fine on unified memory systems

essentially, hardware is the main reason you may choose one or the other locally

i have a Strix Halo system so I will be trying this Dwarf Star 4 thingie eventually when i have some free time

sbinnee 1 days ago [-]

It is a big thing for sure to have a competitive local agentic model. I've replaced gemini 3 flash preview with DeepSeek v4 flash for all of my personal use cases. Starting from chat app, language learning, and even hobby coding. For coding, I couldn't get decent results no matter which sota latest models I used before. It's not close to Opus or Codex models. It's a flash model and makes mistakes here and there (I just saw `from opentele while import trace`, new Python syntax!)

But I found its tool calling is reliable than other oss models I tried. I assume that it attributes to interleaved thinking. Its reasoning effort is adjusted automatically by queries. I enjoy reading these reasoning traces from open models because you can't see them from proprietary models.

I would love to try DS4 so bad. Well, I don't have a machine for it. I will just stick to openrouter. I wish I can run a competitive oss model on 32GB machine in 3 years.

zozbot234 1 days ago [-]

> I wish I can run a competitive oss model on 32GB machine in 3 years.

You could try DS4 on that machine anyway and see how gracefully it degrades (assuming that it runs and doesn't just OOM immediately). Experimenting with 36GB/48GB/64GB would also be nice; they might be able to gain some compute throughput back by batching multiple sessions together (though obviously at the expense of speed for any single session).

thegeomaster 1 days ago [-]

> `from opentele while import trace`

FYI, this to me points to an inference bug, bad sampling, or a non-native quant. OpenRouter is known to route requests to absolutely terrible, borked implementations. A model like DeepSeek V4 Flash shouldn't be making syntax errors like this.

kristianp 1 days ago [-]

> I wish I can run a competitive oss model on 32GB machine in 3 years.

It's so hard to predict what size the open-weight models will be, even in 6 months time. Will a 96GB machine turn out to be a complete waste of money? Who knows.

DANmode 10 hours ago [-]

Why would it be?

Today’s models, today’s usefulness doesn’t disappear tomorrow.

1 days ago [-]

sourcecodeplz 17 hours ago [-]

This project is a week old and already super popular. Guess people really were tired of lmstudio or tunning llama.cpp with settings.

zargon 15 hours ago [-]

llama.cpp (and consequently LM Studio) don't support DeepSeek V4. If you want to run V4, this is your only option right now unless you have hardware that can run vLLM.

karel-3d 16 hours ago [-]

Oh a local DeepSeek? Nice

> Starting from MacBooks with 96GB of RAM.

... oh. And I thought I bought a lot with 48 GB.

zozbot234 16 hours ago [-]

96GB is what the author claims will work in a foolproof way for easy production use. But nothing stops you from trying to run it on 48GB, it ought to gracefully fall back on accessing model layers from the disk.

brcmthrowaway 1 days ago [-]

This guy is falling deep into Yegge-tier psychosis.

linkregister 1 days ago [-]

Empirically, DS4 is hosting the DeepSeek v4 Flash model with good performance on home hardware. I'm curious how you came to this conclusion.

dakolli 1 days ago [-]

"Empirically", have you tested this yourself?

linkregister 23 hours ago [-]

It's trivial to find reviews and benchmarks of DS4 online. Also, there are benchmarks in the article.

Here's one of the top hits: https://forums.developer.nvidia.com/t/fully-custom-cuda-nati...

Bizarre comment; sounds like "How do you know Porsches are fast? Did you drive one?"

calmingsolitude 22 hours ago [-]

Parent is simply pointing out the incorrect usage of "empirically", which should typically only be mentioned when you've tested it yourself.

linkregister 21 hours ago [-]

I'm having trouble finding dictionaries or other references that add the qualifier that it needs to be self-tested and not relying on the research of others. Can you point me to one?

dakolli 21 hours ago [-]

I don't think comments on the internet count as "empirical" evidence, but sure.

linkregister 20 hours ago [-]

If you think antirez's benchmarks in the blog post are false, you should make the claim. Continue to move the goal posts.

dakolli 22 hours ago [-]

Are you comparing an LLM running on a laptop to a Porsche?

I just find it really funny people are willing to write things like "empirically speaking, X is obvious" without actually testing it themselves.

I've seen mixed reviews, and the most honest sounding ones have said it has latency issues.

I don't really care that much what the average LLM power user says at this point, they're impressed by anything an LLM does. They're like toddler's entertained by the sound their Velcro shoes make.

You LLM people are going to be like my mom, once she got an Maps app she completely gave up on navigating anywhere with her own brain, and is lost without a phone.

Except for you LLM people, its going to be reading, writing, problem solving and thinking in general. You'll be completely reliant on an llm to get anything done, have fun with that. You're cooked bro.

linkregister 21 hours ago [-]

It's funny because you make these assertions without any empiricism of your own. They're just speculations.

"You LLM people". Has it occurred to you that individuals have variation within groups?

wren6991 21 hours ago [-]

Not even close. "I made this DSP task faster by focusing on exactly one compute graph on one machine instead of a compute graph compiler that runs on every possible machine" is a real engineering approach, and the AI usage is incidental. Things like Gas Town are self-serving turboslop whose only purpose is to generate more slop.

fgfarben 1 days ago [-]

Nope.

vrighter 20 hours ago [-]

Damn it I was expecting something interesting about the ps4 controller. Not some more junk about AI. Such a rugpull

codedokode 1 days ago [-]

I thought DeepSeek was closed-weights and proprietary? I wonder how it compares against Western open-weight models. The hugging face page contains the comparison only with proprietary models for some reason.

itishappy 1 days ago [-]

DeepSeek has always been open-weight, and the DeepSeek HuggingFace page does not contain any comparisons. Where did you form these opinions?

codedokode 1 days ago [-]

It contains comparisons: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash

itishappy 1 days ago [-]

Just the first one then...

Apologies. Where did I form my opinions?

1 days ago [-]

zozbot234 1 days ago [-]

Nemotron would be a comparable Western open model AIUI.

Rendered at 01:40:57 GMT+0000 (Coordinated Universal Time) with Vercel.