I am! I moved from a shoebox Linux workstation with 32MB of RAM and a 12GB RTX 3060 to a 256GB M3 Ultra, mainly for unified memory.
I've only had it a couple of months, but so far it's proving its worth in the quality of LLM output, even quantized.
I generally run Qwen3-vl at 235b, at a Q4_K_M quantization level so that it fits, and it leaves me plenty of RAM for workstation tasks while delivering tokens at around 30tok/s
The smaller Qwen3 models (like qwen3-coder) I use in tandem, of course they run much faster and I tend to run them at higher quants up to Q8 for quality purposes.
The gigantic RAM's biggest boon, I've found, is letting me run the models with full context allocated, which lets me hand them larger and more complicated things than I could before. This alone makes the money I spent worth it, IMO.
I did manage to get glm-4.7 (a 358b model) running at a Q3 quantization level; it's delivery is adequate quality-wise, although it delivers at 15tok/s, though I did have to cut down to only 128k context to leave me enough room for the desktop.
If you get something this big, it's a powerhouse, but not nearly as much of a powerhouse as a dedicated nVidia GPU rig. The point is to be able to run them _adequately_, not at production speeds, to get your work done. I found price/performance/energy usage to be compelling at this level and I am very satisfied.
ryan-c 1 days ago [-]
I'm using an M3 Ultra w/ 512GB of RAM, using LMStudio and mostly mlx models. It runs massive models with reasonable tokens per second, though prompt processing can be slow. It handles long conversations fine so long as the KV cache hits. It's usable with opencode and crush, though my main motivation for getting it was specifically to be able to process personal data (e.g. emails) privately, and to experiment freely with abliterated models for security research. Also, I appreciate being able to run it off solar power.
I'm still trying to figure out a good solution for fast external storage, I only went for 1TB internal which doesn't go very far with models that have hundreds of billions of parameters.
I also use 10gbps Terramaster 4-bay RAIDs (how I finally retired my Pro5,1).
>energy usage
This thing uses an order of magnitude -less- energy than the computer it replaced, and is faster in almost every aspect.
gneuron 15 hours ago [-]
This is the way brother
StevenNunez 2 days ago [-]
I do! I have an M3 Ultra with 512GB. A couple of opencode sessions running work well. Currently running GML 4.7 but was on Kimi K2.5. Both great. Excited for more efficiencies to make their way to LLMs in general.
circularfoyers 1 days ago [-]
The prompt processing times I've heard about have put me off wanting to go that high with memory on the M series (hoping that changes for the M5 series though). What's the average and longest times you've had to wait when using opencode? Has any improvements to mlx helped in that regard?
jtbaker 6 hours ago [-]
The M5 ultra series is supposed to have some big gains around prompt processing - something like 3-4x from what I've read. I'm tempted to swap out my m4 mini that I'm using for this kind of stuff right now!
pcf 18 hours ago [-]
Wow, Kimi K2.5 runs on a single M3 Ultra with 512 GB RAM?
Can you share more info about quants or whatever is relevant? That's super interesting, since it's such a capable model.
satvikpendem 1 days ago [-]
How's the inference speed? What was the price? I'm guessing you can fit the entire model without quantization?
UmYeahNo 1 days ago [-]
Excellent. Thanks for the info!
pcf 18 hours ago [-]
Below are my test results after running local LLMs on two machines.
I'm using LM Studio now for ease of use and simple logging/viewing of previous conversations. Later I'm gonna use my own custom local LLM system on the Mac Studio, probably orchestrated by LangChain and running models with llama.cpp.
My goal has all the time been to use them in ensembles in order to reduce model biases. The same principle has just now been introduced as a feature called "model council" in Perplexity Max: https://www.perplexity.ai/hub/blog/introducing-model-council
Chats will be stored in and recalled from a PostgreSQL database with extensions for vectors (pgvector) and graph (Apache AGE).
For both sets of tests below, MLX was used when available, but ultimately ran at almost the same speed as GGUF.
I hope this information helps someone!
/////////
Mac Studio M3 Ultra (default w/96 GB RAM, 1 TB SSD, 28C CPU, 60C GPU):
• Gemma 3 27B (Q4_K_M): ~30 tok/s, TTFT ~0.52 s
• GPT-OSS 20B: ~150 tok/s
• GPT-OSS 120B: ~23 tok/s, TTFT ~2.3 s
• Qwen3 14B (Q6_K): ~47 tok/s, TTFT ~0.35 s
(GPT-OSS quants and 20B TTFT info not available anymore)
//////////
MacBook Pro M1 Max 16.2" (64 GB RAM, 2 TB SSD, 10C CPU, 32C GPU):
• Gemma 3 1B (Q4_K): ~85.7 tok/s, TTFT ~0.39 s
• Gemma 3 27B (Q8_0): ~7.5 tok/s, TTFT ~3.11 s
• GPT-OSS 20B (8bit): ~38.4 tok/s, TTFT ~21.15 s
• LFM2 1.2B: ~119.9 tok/s, TTFT ~0.57 s
• LFM2 2.6B (Q6_K): ~69.3 tok/s, TTFT ~0.14 s
• Olmo 3 32B Think: ~11.0 tok/s, TTFT ~22.12 s
TomMasz 23 hours ago [-]
I've got an M2 Ultra with 64 GB, and I've been using gpt-oss-20b lately with good results. Performance and RAM usage have been reasonable for what I've been doing. I've been thinking of trying the newer Qwen 3 Coder Next just to see what it's like, though.
satvikpendem 2 days ago [-]
There are some people on r/LocalLlama using it [0]. Seems like the consensus is while it does have more unified RAM for running models, up to half a terabyte, the token generation speed can be fairly slow such that it might just be better to get an Nvidia or AMD machine.
I have a maxed out M3 Ultra. It runs quantized large open Chinese models pretty well. It's slow-ish, but since I don't use them very frequently, most of the time is waiting to the model to load from disk to RAM.
There are benchmarks on token generation speed out there for some of the large models. You can probably guess the speed for models you're interested in by comparing the sizes (mostly look at the active params).
Currently the main issue for M1-M4 is the prompt "preprocessing" speed. In practical terms, if you have a very long prompt, it's going to take a much long time to process it. IIRC it's due to lack of efficient matrix multiplication operations in the hardware, which I hear is rectified in the M5 architecture. So if you need to process long prompts, don't count on the Mac Studio, at least not with large models.
So in short, if your prompts are relatively short (eg. a couple thousand tokens at most), you need/want a large model, you don't need too much scale/speed, and you need to run inference locally, then Macs are a reasonable option.
For me personally, I got my M3 Ultra somewhat due to geopolitical issues. I'm barred from accessing some of the SOTA models from the US due to where I live, and sometimes the Chinese models are not conveniently accessible either. With the hardware, they can pry DeepSeek R1, Kimi-K2, etc. from my cold dead hands lol.
runjake 20 hours ago [-]
For anything other than a toy, I would recommend at least a Max processor and at least 32 GB memory, depending on what you're doing. I do a lot of text, audio, and NLP stuff, so I'm running smaller models and my 36GB is plenty.
Ultra processors are priced high enough, I'd be asking myself if I'm serious about local LLM work and do a cost analysis.
giancarlostoro 1 days ago [-]
Not a Mac Studio but I use a basic Macbook Pro laptop with 24 GB of RAM (16 usable as VRAM) and I can run a number of models on it at decent speed, my main bottleneck is context window size, but if I am asking single purpose questions I am fine.
UmYeahNo 1 days ago [-]
Yeah. I'm currently on an Mac Mini m2 Pro with 32GB or ram, and I was so curious how much more I could get out of the Apple ecosystem. Thanks for your perspective.
StrangeSound 1 days ago [-]
What models are you running?
giancarlostoro 9 hours ago [-]
The most I ran was a GPT 20b model, I also run SDXL it runs rather quickly via the Draw Things app. There's an 8 step LoRa that lets you generate images in just 8 steps.
rlupi 1 days ago [-]
I have an M3 Ultra 96 GB, it works reasonably well with something like qwen/qwen3-vl-30b (fast) or openai/gpt-oss-120b (slow-ish) or openai/gpt-oss-20b (fast, largest context). I keep the latter loaded, and have a cronjob that generates a new MOTD for my shell every 15 minutes with information gathered from various sources.
caterama 20 hours ago [-]
M3 Ultra with 256 GB memory, using GPT-OSS 120b in ollama. It’s decently fast, but makes the system somewhat unstable. Have to reboot frequently otherwise the GPU seems to flake (eg visual artifacts / glitches in other programs).
stoneforger 23 hours ago [-]
M4 mini pro 24gb qwen3-8b-mlx and others. Speed is fine, problem is context window. In theory CoreML would be better from an efficiency perspective but I think it's non-trivial to run models with CoreML ( could be wrong )
My experience with Mac Studio is that memory bandwidth matters more than raw cores for reasonable LLM throughput locally; curious what others find for models >13B parameters?
mannyv 1 days ago [-]
Mine is a M1 ultra with 128gb of ram. It's fast enough for me.
UmYeahNo 1 days ago [-]
Thanks for the perspective!
callbacked 17 hours ago [-]
can only speak for myself here, but the prompt processing speeds on Apple Sillicon is too slow, especially for any meaningful usage
Adanos 1 days ago [-]
Nope, my Macbook Pro is enough for now
Rendered at 12:37:22 GMT+0000 (Coordinated Universal Time) with Vercel.
I've only had it a couple of months, but so far it's proving its worth in the quality of LLM output, even quantized.
I generally run Qwen3-vl at 235b, at a Q4_K_M quantization level so that it fits, and it leaves me plenty of RAM for workstation tasks while delivering tokens at around 30tok/s
The smaller Qwen3 models (like qwen3-coder) I use in tandem, of course they run much faster and I tend to run them at higher quants up to Q8 for quality purposes.
The gigantic RAM's biggest boon, I've found, is letting me run the models with full context allocated, which lets me hand them larger and more complicated things than I could before. This alone makes the money I spent worth it, IMO.
I did manage to get glm-4.7 (a 358b model) running at a Q3 quantization level; it's delivery is adequate quality-wise, although it delivers at 15tok/s, though I did have to cut down to only 128k context to leave me enough room for the desktop.
If you get something this big, it's a powerhouse, but not nearly as much of a powerhouse as a dedicated nVidia GPU rig. The point is to be able to run them _adequately_, not at production speeds, to get your work done. I found price/performance/energy usage to be compelling at this level and I am very satisfied.
I'm still trying to figure out a good solution for fast external storage, I only went for 1TB internal which doesn't go very far with models that have hundreds of billions of parameters.
Acasis makes 40gbps external nVME cases. Mine feels quick (for non-LLM tasks).
I also use 10gbps Terramaster 4-bay RAIDs (how I finally retired my Pro5,1).
>energy usage
This thing uses an order of magnitude -less- energy than the computer it replaced, and is faster in almost every aspect.
Can you share more info about quants or whatever is relevant? That's super interesting, since it's such a capable model.
I'm using LM Studio now for ease of use and simple logging/viewing of previous conversations. Later I'm gonna use my own custom local LLM system on the Mac Studio, probably orchestrated by LangChain and running models with llama.cpp.
My goal has all the time been to use them in ensembles in order to reduce model biases. The same principle has just now been introduced as a feature called "model council" in Perplexity Max: https://www.perplexity.ai/hub/blog/introducing-model-council
Chats will be stored in and recalled from a PostgreSQL database with extensions for vectors (pgvector) and graph (Apache AGE).
For both sets of tests below, MLX was used when available, but ultimately ran at almost the same speed as GGUF.
I hope this information helps someone!
/////////
Mac Studio M3 Ultra (default w/96 GB RAM, 1 TB SSD, 28C CPU, 60C GPU):
• Gemma 3 27B (Q4_K_M): ~30 tok/s, TTFT ~0.52 s
• GPT-OSS 20B: ~150 tok/s
• GPT-OSS 120B: ~23 tok/s, TTFT ~2.3 s
• Qwen3 14B (Q6_K): ~47 tok/s, TTFT ~0.35 s
(GPT-OSS quants and 20B TTFT info not available anymore)
//////////
MacBook Pro M1 Max 16.2" (64 GB RAM, 2 TB SSD, 10C CPU, 32C GPU):
• Gemma 3 1B (Q4_K): ~85.7 tok/s, TTFT ~0.39 s
• Gemma 3 27B (Q8_0): ~7.5 tok/s, TTFT ~3.11 s
• GPT-OSS 20B (8bit): ~38.4 tok/s, TTFT ~21.15 s
• LFM2 1.2B: ~119.9 tok/s, TTFT ~0.57 s
• LFM2 2.6B (Q6_K): ~69.3 tok/s, TTFT ~0.14 s
• Olmo 3 32B Think: ~11.0 tok/s, TTFT ~22.12 s
[0] https://old.reddit.com/r/LocalLLaMA/search?q=mac+studio&rest...
There are benchmarks on token generation speed out there for some of the large models. You can probably guess the speed for models you're interested in by comparing the sizes (mostly look at the active params).
Currently the main issue for M1-M4 is the prompt "preprocessing" speed. In practical terms, if you have a very long prompt, it's going to take a much long time to process it. IIRC it's due to lack of efficient matrix multiplication operations in the hardware, which I hear is rectified in the M5 architecture. So if you need to process long prompts, don't count on the Mac Studio, at least not with large models.
So in short, if your prompts are relatively short (eg. a couple thousand tokens at most), you need/want a large model, you don't need too much scale/speed, and you need to run inference locally, then Macs are a reasonable option.
For me personally, I got my M3 Ultra somewhat due to geopolitical issues. I'm barred from accessing some of the SOTA models from the US due to where I live, and sometimes the Chinese models are not conveniently accessible either. With the hardware, they can pry DeepSeek R1, Kimi-K2, etc. from my cold dead hands lol.
Ultra processors are priced high enough, I'd be asking myself if I'm serious about local LLM work and do a cost analysis.