We're still adding samples, but some early takeaways from benchmarking on https://gertlabs.com:
Contrary to the model card, its one-shot performance is more impressive than its agentic abilities. On both metrics, GLM 5.1 is competitive with frontier models.
But keeping in mind this is an open source model operating near the frontier, it's nothing short of incredible.
I suspect 2 issues with the model are keeping it from fully realizing its potential in agentic harnesses:
- Context rot (already a common complaint). We are still working on a metric to robustly test and visualize this on the site.
- The model was most likely overtrained on standardized toolsets and benchmarks, and isn't as adaptive in using arbitrary tooling in our custom harness simulations. We've decided to commit to measuring intelligence as the ability to use custom, changing tools, instead of being trained to use specific tools (while still always providing a way to run local bash and other common tools). There are arguments to be made for either, but the former is more indicative of general intelligence. Regardless, it's a subtle difference and GLM 5.1 still performs well with tooling in our environments.
Crazy week for open source AI. Gemma 4 has shown that large model density is nowhere near optimized. Moats are shrinking.
If there are more representations of model performance you'd like to see, I'm actively reading your feedback and ideas.
DeathArrow 16 hours ago [-]
It would be nice if you can test the model with different harnesses, Z.ai's own Z Code, Claude Code, Open Code, Pi, Cursor etc.
My impression is that the choice of harness matters a lot.
gertlabs 15 hours ago [-]
Interesting idea. The metric I'd intuitively want to see is low variance between harnesses for a smarter model. But if a large sample of models statistically outperformed with a certain harness, that's indeed a valuable signal for a developer.
nareyko 16 hours ago [-]
[dead]
IceHegel 14 hours ago [-]
[dead]
Ms-J 16 hours ago [-]
Z.ai and their GLM models are pretty low quality.
I've been testing it for awhile now since it seemed to have potential as a local model.
With this new update it still cannot parse simple, test PDFs correctly. It inconsistently tells me that the value in the name field in the document is incorrect, and has the name reversed to put the last name first. Or that a date is wrong as it's in the past/future, when it is not. Tons of fundamental errors like that.
Even when looking at the thinking process there are issues:
I used a test website for it to analyze and it says that the sites copyright year states 2026 which is in the future and to investigate as it could be an attack, but right after prints today's correct date.
I'm in the process of trying to get it uncensored. Hopefully that will create some use out of z.ai
Edit: by the way, which is the best uncensored model at the moment?
rednb 15 hours ago [-]
I'e been using their models pretty much daily for the past 2 months to work on the codebase of a very complex B2B2C platform written in an unusual functional language (F#) with an angular frontend.
I also use Claude premium daily for another client, and i use Codex. and i can tell you that GLM5 is at this point much more capable than Claude and Codex for complex backend end work, complex feature planning, and long horizon tasks. One thing i've noticed is that it is particularly good at following instructions and guidelines, even deep into the execution of a plan.
To me the only problem is that z.ai have had trouble with inference : the performance of their API has been pretty poor at times. It looks like this is an hardware issue related to the Huawei chips they use rather than an issue with the model itself. The situation has been substantially improving over the past few weeks.
GLM5.1, GLM5-Turbo and GLM5v are at this point better than Opus, Codex, Gemini and other claude source models. We have reached a major turning point. To me, the only closed source model still in the game is codex as it is much faster at executing simple tasks and implementing already created plans.
Try GLM5v for your PDF work, it's their last generation vision model that has been released a couple of days ago.
0x008 14 hours ago [-]
Does anyone have inside info on what these Huawai chips look like? I know Google has a Torus architecture unlike Nvidias fully connected one. Maybe it’s a similar architectural decision on the huawai chips that leads to bottlenecks in serving?
>For AI computing, the Atlas 950 SuperPoD, powered by UnifiedBus, integrates 64 NPUs per cabinet and can scale up to 8,192 NPUs, delivering superior performance for large-scale AI training and high-concurrency inference.
blazarquasar 13 hours ago [-]
Plenty of other providers that offer much faster inference on GLM-5.1. Friendli, GMICloud, Venice, Fireworks, etc. And can be deployed through Bedrock already as well. Will probably be available generally in Bedrock soon, I would guess.
electroglyph 12 hours ago [-]
better than Opus? not even close. after struggling thru server overload for the past couple hours i finally put 5.1 thru the paces and it's....okay. failed some simple stuff that Sonnet/Opus/Gemini didn't. failed it badly and repeatedly actually. this was in typescript, btw. not sure if i'll keep the subscription or not
Ms-J 15 hours ago [-]
[flagged]
rednb 15 hours ago [-]
I appreciate that it's not working for your use case but it's unfortunate that you dismiss the experience of others. And i am not chinese, I am European. Thanks for your feedback anyway.
Ms-J 14 hours ago [-]
[dead]
dahrkael 14 hours ago [-]
I tried Gemini 3.1 pro once to implement a previously designed 7-phase plan.
it only implemented a quarter of the plan before stopping, the code didnt even compile because half of the scaffolding was missing. it then confidently said everything was done.
Codex and GLM didnt have any issue following the exact same plan and getting a working app. So I would argue Gemini is the failure here.
nsonha 14 hours ago [-]
Sounds like you two are taking pass each other. PDF work is a specific niche that according to you it fails, the other person say it's good at coding.
Ms-J 14 hours ago [-]
Scroll down to my other comment, I've used it specifically for coding as well.
"It couldn't even debug some moderately complicated python scripts reliably."
WhitneyLand 8 hours ago [-]
“GLM5…better than Opus, Codex, Gemini…”
What wild claim to make. Unsupported by benchmarks, unsupported by the consensus of the community, no evidence provided.
Sounds like in another comment here even the GLM5 team concedes they are behind the frontier wrt tool calling, do you know something they don’t?
rednb 7 hours ago [-]
I know my use case and my personal experience :) i am not trying to pretend that it is the best in benchmarks, just sharing my experience so people know that some folks are having a very good experience with GLM models, compared to the competition.
My only goal is to encourage people to try it out so they can see if it moves the needle for them, because there are fair chances that it will. I am not trying to start a flamewar or something.
elzbardico 11 minutes ago [-]
FWIW, my experience is the same. Paired with opencode it has been excellent to me.
WhitneyLand 3 hours ago [-]
It’s not a flame war, and you’re not just sharing your experience and encouraging others to try it out.
You’re making a claim, and I’m pointing out that it’s unsubstantiated and not consistent with any other source of data, including that internal to the company that makes the model.
I hope you can see that that’s different than saying it’s worked well for me
rednb 2 hours ago [-]
Sometimes we STEM folks are way too rigid, I obviously meant "IN MY OPINION, GLM models are at this point superior to...".
I do not think that anyone who read my comment understood it differently. But I grant you this point, this is just my opinion based on my personal experience not the result of a scientific study.
Once this is said, i wasn't submitting a scientific paper for preprint, just posting my opinion on an internet forum.
Not sure why you are making such a big deal out of it, especially for something for which people can decide within minutes if it works for them or not. And I haven't seen you nitpick on other people saying that all Chinese models are garbage incapable of doing even the most basic task, without quoting any study. This kind of scrutiny tends to be one-sided.
Edit: and regarding what the z.ai team is saying about their models, just check their Discord and the articles they link there. They themselves say that their latest models have leading performance on a number of aspects. It is misleading to suggest that the authors of the model are not proudly saying that their models have best in class performance.
adrian_b 13 hours ago [-]
I do not know if it is good, because I have not tested it yet, but the most recent uncensored model is:
which was produced immediately after Google released their new Gemma 4 model.
ra 12 hours ago [-]
I still use GLM 4.7 for well defined coding tasks. I never got 5.0 to work satisfactorily, it felt like a hosting problem (z.ai) where it would work for a while then, for whatever reason, it couldn't respond to the context any more - but that's just a hunch.
I had no such trouble with 4.7 and find it fast and productive. Haven't tried 5.1; am using openAI models for coding most of the time.
knbknb 8 hours ago [-]
Same here.
Z.ai seem to promote 4.7 for smaller tasks, 5.1 for larger tasks (similar to Anthropic's recommendation for usage of Haiku and Sonnet/Opus models).
5.1 works for me already in the most economical basic paid tier ("lite coding plan"), unlike first release of v5 (5.0 ?)
stavros 11 hours ago [-]
I hit this as well. It just seems to hang and process for ages.
alexfortin 9 hours ago [-]
Try lowering thinking level with GLM-5.1, to me that seems to have an impact on mitigating the blocking behaviour.
stavros 8 hours ago [-]
Hmm I'll try that, but OpenCode shows me the thinking and it's not even doing that. I'm just getting no tokens from it at all.
uvu 15 hours ago [-]
Completely agree with this statement "Z.ai and their GLM models are pretty low quality." I have been trying out and it's kind of useless compare to SOTA models.
Ms-J 15 hours ago [-]
[flagged]
victorbjorklund 12 hours ago [-]
I don't agree. I think their models are pretty good. The company's infrastructure though seems to be so so.
chimpanzee2 9 hours ago [-]
From what I gather qwen is currently the undisputed local LLM king.
orbital-decay 12 hours ago [-]
>by the way, which is the best uncensored model at the moment?
There are no such models, depending on your definition of censorship. If you're referring to abliteration and similar automated techniques, they're snake oil.
CamperBob2 5 hours ago [-]
That is absolutely not the case. Try HauHauCS's Qwen 3.5 models. They don't refuse anything, and they don't lose a noticeable amount of capability.
Surely at this point it’s part of the training set and the benchmark has lost its value?
Marciplan 5 hours ago [-]
these comments are as useless as simon posting his pelicans
ipsum2 23 hours ago [-]
It made it realistic. A pelican is much more likely to be flying in the sky than riding a bicycle.
_pdp_ 22 hours ago [-]
Simon, you need to come up with improved benchmarks soon.
lemonish97 22 hours ago [-]
Agree. But you can keep the pelican theme in whatever new benchmark you choose to come up with. Iconic at this point.
fancy_pantser 20 hours ago [-]
let me see Tayne with a hat wobble
Yukonv 1 days ago [-]
Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.
SSD offload is always a possibility with good software support. Of course you might easily object that the model would not be "running" then, more like crawling. Still you'd be able to execute it locally and get it to respond after some time.
Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.
adrian_b 1 days ago [-]
For conversational purposes that may be too slow, but as a coding assistant this should work, especially if many tasks are batched, so that they may progress simultaneously through a single pass over the SSD data.
QuantumNomad_ 1 days ago [-]
Three hour coffee break while the LLM prepares scaffolding for the project.
pbhjpbhj 23 hours ago [-]
Like computing used to be. When I first compiled a Linux kernel it ran overnight on a Pentium-S. I had little idea what I was doing, probably compiled all the modules by mistake.
stingraycharles 18 hours ago [-]
I remember that time, where compiling Linux kernels was measured in hours. Then multi-core computing arrived, and after a few years it was down to 10 minutes.
With LLMs it feels more like the old punchcards, though.
drowsspa 22 hours ago [-]
At least the compiler was free
adrian_b 13 hours ago [-]
The point of doing local inference with huge models stored on an SSD is to do it free, even if slow.
tempoponet 8 hours ago [-]
Rather, Imagine you have 2-3 of these working 24/7 on top of what you're doing today. What does your backlog look like a month from now?
cyanydeez 24 hours ago [-]
[flagged]
dcreater 23 hours ago [-]
@dang
zozbot234 1 days ago [-]
Batching many disparate tasks together is good for compute efficiency, but makes it harder to keep the full KV-cache for each in RAM. You could handle this in an emergency by dumping some of that KV-cache to storage (this is how prompt caching works too, AIUI) and offloading loads for that too, but that adds a lot more overhead compared to just offloading sparsely-used experts, since KV-cache is far more heavily accessed.
alex7o 1 days ago [-]
To be honest I am a bit sad as, glm5.1 is producing mich better typescript than opus or codex imo, but no matter what it does sometimes go into shizo mode at some point over longer contexts. Not always tho I have had multiple session go over 200k and be fine.
InsideOutSanta 1 days ago [-]
I just set the context window to 100k and manage it actively (e.g. I compact it regularly or make it write out documentation of its current state and start a new session).
For me, Opus 4.6 isn't working quite right currently, and I often use GLM 5.1 instead. I'd prefer to use peak Opus over GLM 5.1, but GLM 5.1 is an adequate fallback. It's incredible how good open-weight models have gotten.
disiplus 1 days ago [-]
When it works and its not slow it can impress. Like yesterday it solved something that kimi k2.5 could not. and kimi was best open source model for me. But it still slow sometimes. I have z.ai and kimi subscription when i run out of tokens for claude (max) and codex(plus).
i have a feeling its nearing opus 4.5 level if they could fix it getting crazy after like 100k tokens.
DeathArrow 15 hours ago [-]
Why don't you start a new session or use the /compact command when context gets to 100k tokens?
From my testing it was ok until 145k tokens, the largest context I had before switching to a new session. I think Z.ai officially said it should be good until 200k tokens.
Using it in Open Code is compacting the context automatically when it gets too large.
MegagramEnjoyer 1 days ago [-]
Why is that sad? A free and open source model outperforming their closed source counterparts is always a win for the users
KaoruAoiShiho 1 days ago [-]
The non-awesome context window is the sad part, but I think a better harness can deal with this.
cmrdporcupine 1 days ago [-]
I honestly still hold onto habits from earlier days of Claude & Codex usage and tend to wipe / compact my context frequently. I don't trust the era of big giant contexts, frankly, even on the frontier models.
calgoo 1 days ago [-]
I also feel like its helping me on the big models these days with claude giving so many issues.
DeathArrow 1 days ago [-]
After the context gets to 100k tokens you should open a new session or run /compact.
csomar 16 hours ago [-]
I've set max context to 180k and usually compact around 120k. It's much better to re-read stuff than to have it under-perform when it's over 120k.
varispeed 1 days ago [-]
Isn't the same with opus nowadays?
dvt 21 hours ago [-]
Every single day, three things are becoming more and more clear:
(1) OpenAI & Anthropic are absolutely cooked; it's obvious they have no moat
(2) Local/private inference is the future of AI
(3) There's *still* no killer product yet (so get to work!)
bottlepalm 18 hours ago [-]
This has got to be bait..
1) OpenAI and Anthropic are killing it, and continue to do so, their coding tools are unmatched for professionals.
2) Local models don't hold a candle to SOTA models and there's nothing on the horizon that indicates that consumers will be able to run anything close to what you can get in a data center.
3) Coding is a killer product, OpenAI and Anthropic are raking in the cash. The top 3 apps are apps in the app store are AI. Everyone who knows anything is using AI, every day, across the economy.
svcrunch 17 hours ago [-]
The grandparent is definitely wrong on (3). Yes, coding is a killer product, I agree with you.
On (2), I agree with you for local models. BUT, there are also the open source Chinese models accessible via open-router. Your argument ("don't hold a candle to SOTA models") does not hold if the comparison is between those.
On (1), I agree more with the grandparent than with your assessment. Yes, OpenAI and Anthropic are killing it for now, but the time horizon is very short. I use codex and claude daily, but it's also clear to me that open source is catching up quickly, both w.r.t. the models and the agentic harnesses.
geysersam 5 hours ago [-]
Open models are good but if you need a $10k GPU to run them then 99% of people are better of subscribing to OAI or CC.
Nowadays I also feel model performance matters less than the design of the tool harness, inference speed, and the other systems that surround a typical coding model.
DeathArrow 16 hours ago [-]
>BUT, there are also the open source Chinese models accessible via open-router.
I thought so myself, but after burning a lot of money on OpenRouter in a few days I just subscribed to Z.ai's Coding Pro plan and using the subscription is much, much friendlier with my wallet.
itake 17 hours ago [-]
> the open source Chinese models accessible via open-router
And? They aren't as good as SOTA models. Even the SOTA model provider's small models aren't worth using for many of my coding tasks.
DeathArrow 16 hours ago [-]
In my limited experience with it, GLM 5.1 is on par with Opus 4.6.
naasking 5 hours ago [-]
I used GLM5 quite a bit, and I'd say it was maybe on par with Sonnet for most simple to medium tasks. Definitely not Opus though. Didn't test super long context tasks, and that's where I would expect it to break down. A recent study on software maintainability still showed Sonnet and Opus were peerless on that metric, although GLM series of models has been making impressive gains.
dvt 17 hours ago [-]
I don't want to respond to 100 comments about the same thing, and this one happens to be on top, so, in my humble opinion:
(1): You don't have to be an Ed Zitron disciple to infer that OpenAI and Anthropic are likely overvalued and that Nvidia is selling everyone shovels in a gold rush. AI is a game-changing technology, but a shitty chat interface does not a company make. OpenAI and Anthropic need to recoup astronomical costs used in training these models. Models that are now being distilled[1] and are quickly becoming commoditized. (And frankly, models that were trained by torrenting copyrighted data[2], anyway.) Many have been calling this out for years: the model cannot be your product. And to be clear, OpenAI/Anthropic most definitely know this: that's why they've been aquihiring like crazy, trying to find that one team that will make the thing.
(2): Token prices are significantly subsidized and anyone that does any serious work with AI can tell you this. Go use an almost-SOTA model (a big Deepseek or Qwen model) offered by many bare-metal providers and you'll see what "true" token prices should look like. The end-state here is likely some models running locally and some running in the cloud. But the current state of OpenClaw token-vomit on top of Claude is fiscally untenable (in fact, this is why Anthropic shut it down).
(3): This is typical Dropbox HN snark[3], of which I am also often guilty of. I really don't think AI coding is a killer product and this seems very myopic—engineers are an extreme minority. Imo, the closest we've seen to something revolutionary is OpenClaw, but it's janky, hard to set up, full of vulnerabilities, and you need to buy a separate computer. But there's certainly a spark there. (And that's personally the vertical I'm focusing on.)
> And to be clear, OpenAI/Anthropic most definitely know this: that's why they've been aquihiring like crazy, trying to find that one team that will make the thing.
Anthropic is up to $30B annual recurring revenue. I wish I had failing business models like that.
> Token prices are significantly subsidized and anyone that does any serious work with AI can tell you this. Go use an almost-SOTA model (a big Deepseek or Qwen model) offered by many bare-metal providers and you'll see what "true" token prices should look like.
I'm not sure what think you are saying here, but if you look at the providers for both "almost-SOTA model (a big Deepseek or Qwen model)" or at the price for Claude on AWS Bedrock, Azure or on GCP you will quickly see inference is very profitable.
monooso 11 hours ago [-]
> Anthropic is up to $30B annual recurring revenue. I wish I had failing business models like that.
And profit? A company can have $300B annual revenue, and still be a failing business if it's making a loss.
Somewhere along the line we seem to have forgotten this basic fact. Eventually there will be no more rounds of funding to feed the fire.
nl 5 hours ago [-]
Anthropic has raised $64B in total since they were founded.
Even if you say we are going to measure profit in the very special hacker news way of looking at money taken in from customer revenue against money invested and we say they can't do things like counting building data centers or buying GPUs as capital expenses and instead have to count them against profit then in 2 years time they will have made more money than they have taken in investment.
That is extraordinary.
tempaccount420 9 hours ago [-]
Costs can always be optimized, revenue is much harder to optimize.
ReptileMan 10 hours ago [-]
It is easy to get 30B when you resell something you buy for 50B
usef- 6 hours ago [-]
The proverbial "50B" is investment in next year's model. The current model cost under "30B", and therefore "is profitable". It is a bet on scaling, yes, but that's been common throughout the industry (see, eg, Amazon not being profitable for many years but building infrastructure)
ReptileMan 6 hours ago [-]
Except the rumors are they subsidize even the inference, not that they have capex in training.
nl 5 hours ago [-]
The maths shows inference is very profitable. Look at how Google/AWS/Azure change the same rates as Anthropic does for running Claude models.
stavros 9 hours ago [-]
> Go use an almost-SOTA model (a big Deepseek or Qwen model) offered by many bare-metal providers and you'll see what "true" token prices should look like.
Qwen3.5-122B-A10B is $0.26 input, $2.08 output. Where's the subsidy? It's ten times cheaper than Opus. Or did you mean that we're subsidizing their training? But then "OpenClaw token-vomit on top of Claude is fiscally untenable" makes no sense.
Yeah, I don't know where you got your costs from. Bare metal providers are significantly cheaper than Anthropic.
usef- 6 hours ago [-]
Maybe he's comparing the renting price of a bare metal server on its own, and doesn't realise how drastically cheaper they are to batch together for an API provider.
6 hours ago [-]
jimmaswell 21 hours ago [-]
No killer product? Coding assistants and LLM's in general are the single most awe-inspiring achievement of humanity in my lifetime, technological or otherwise. They've already massively improved my and others' lives and they're only going to get better. If pre and post industrial revolution used to be the major binary delineation of our history, I'm fairly confident it will soon be seen as pre and post AI instead.
pdntspa 20 hours ago [-]
I know right? 8-year-old me dreamed of being able to articulate software to a computer without having to write code. It (along with the original Stable Diffusion) are Definitely one of the coolest inventions to ever come along in my lifetime
zozbot234 19 hours ago [-]
Coding assistants are currently quite hard to run locally with anything like SOTA abilities. Support in the most popular local inference frameworks is still extremely half-baked (e.g. no seamless offload for larger-than-RAM models; no support for tensor-parallel inference across multiple GPUs, or multiple interconnected machines) and until that improves reliably it's hard to propose spending money on uber-expensive hardware one might be unable to use effectively.
dash2 18 hours ago [-]
This is an argument against the grandparent's points (1) and (2), not their point (3).
zozbot234 18 hours ago [-]
It's one clear argument for the (so get to work!) part.
sunir 18 hours ago [-]
Computers get better and cheaper. That’s not a forever problem.
doctorwho42 16 hours ago [-]
Source?
GPU and RAM prices have definitely not made consumer PC's cheaper than they were before bitcoin blew up or before AI blew up.
Maybe you could make an argument that they are more cost efficient for the price point... But that's not the same as cheaper when every application or program is poorly optimized. For example why would a browser take up more than a GB or two of RAM?
And I'd postulate that R&D to develop localized AI is another example, the big players seem hellbent that there needs to be a most and it's data centers... The absolute opposite of optimization
sunir 4 hours ago [-]
Moore's Law.
We've had RAM shocks before. We nerds can't control the Wall Street or Virginians who like to break the world every so often for the lulz. However, a wobble on the curve doesn't change the curve's destination.
0x457 3 hours ago [-]
You have to look a bit more long term? 256Mb of what today is slow af RAM used to be pretty pricey. Price will pullback.
bitexploder 20 hours ago [-]
No killer products... just robots that can do vulnerability analysis at the level of a decent security engineer and write code without tiring.
allan_s 20 hours ago [-]
I've also been using the LLM in Posthog and it has been impressive. I need to check if I can also plug a MCP/Skill to my actual claude code so that I can cross reference the data from my other data source (stripe, local database, access logs etc.) for in depth analysis
chris_ivester 19 hours ago [-]
This might be up your alley - have Posthog and a ton of other SaaS tools connected so you can run analysis across quant/qualitative data sources: https://dialog.tools
Covenant0028 14 hours ago [-]
> Coding assistants and LLM's in general are the single most awe-inspiring achievement of humanity in my lifetime
Landing a man on the moon is way more impressive. Finding several vaccines for a once in a century pandemic within a year of its outbreak is and achievement that in its impact and importance dwarfs what the entire LLM industry put together has achieved. The near-complete eradication of polio, once again, way more important and impactful.
jimmaswell 11 hours ago [-]
Those are all good things, but with the current AI boom we've invented something with the potential to invent those kinds of things on its own, if not now then in the near future. It's far more important and impactful to invent a digital mind that can invent an arbitrary number of vaccines than to just invent one vaccine, no matter how hard it was to invent the vaccine by hand.
rimliu 12 hours ago [-]
yeah, painting yourself into a corner at 10x speed is hardly the most awe-inspiring achievement of humanity.
grafmax 20 hours ago [-]
> no moat
I'd like to think the superior product wins. But Windows still thrives despite widespread Linux availability. I think sometimes we can underestimate the resilience of the tech oligopolies, particularly when they're VC-funded.
jjfoooo4 19 hours ago [-]
VC can spend all the money in the world and it won't matter if the cost of switching providers is effectively zero.
If I want to switch from Windows to Linux, I have to reconsider a whole variety of applications, learn a different UX, migrate data, all sorts of annoyances.
When I switch between Codex and Claude Code, there is literally no difference in how I interact with them. They and a number of other competitors are drop in replacements for each other.
AlienRobot 18 hours ago [-]
>I'd like to think the superior product wins. But Windows still thrives despite widespread Linux availability.
That's because by most metrics Linux is inferior is Windows.
eldenring 20 hours ago [-]
I don't see how its possible to think this. AI coding assistants are some of the most useful technologies ever created, and model quality is by far the most important thing, so I doesn't make sense why local inference would be the path forward unless something fundamentally changes about hardware.
sunir 18 hours ago [-]
The hardware will change. We know that.
6 hours ago [-]
kcb 21 hours ago [-]
What benefit is there to dropping $50k on GPUs to run this personally besides being a cool enthusiast project?
deminature 20 hours ago [-]
Intel has just released a high VRAM card which allows you to have 128GB of VRAM for $4k. The prices are dropping rapidly. The local models aren't adapted to work on this setup yet, so performance is disappointing. But highly capable local models are becoming increasingly realistic. https://www.youtube.com/watch?v=RcIWhm16ouQ
kcb 18 hours ago [-]
That's 4 32GB GPUs with 600GB/s bw each. This model is not running on that scale GPUs. I think something like 96GB RTX PRO 6000 Blackwells would be the minimum to run a model of this size with performance in the range of subscription models.
acchow 16 hours ago [-]
> I think something like 96GB RTX PRO 6000 Blackwells would be the minimum to run a model of this size with performance in the range of subscription models.
GLM 5.1 has 754B parameters tho. And you still need RAM for context too. You'll want much more than 96GB ram.
marcus_holmes 19 hours ago [-]
Why would anyone need more than 640Kb of memory?
kcb 18 hours ago [-]
Exactly the point though. In the 640KB days there was no subscription to ever increasing compute resources as an alternative.
marcus_holmes 18 hours ago [-]
Well, there kinda was - most computing then was done on mainframes. Personal / Micro computers were seen as a hobby or toy that didn't need any "serious" amounts of memory. And then they ate the world and mainframes became sidelined into a specific niche only used by large institutions because legacy.
I can totally see the same happening here; on-device LLMs are a toy, and then they eat the world and everyone has their own personal LLM running on their own device and the cloud LLMs are a niche used by large institutions.
kcb 18 hours ago [-]
The difference is computers post text terminal are latency and throughput dependent to the user. LLMs are not particularly.
marcus_holmes 18 hours ago [-]
Sorry, I don't understand that comment. Can you clarify, please?
kcb 17 hours ago [-]
My point is LLMs aren't more usable if the hardware is in your room versus a few states away. Personal computers still to this day aren't great when the hardware is fully remote.
marcus_holmes 17 hours ago [-]
Agreed. But you couldn't do much on a PC when they launched, at least compared to a mainframe. The hardware was slow, the memory was limited, there was no networking at all, etc. If you wanted to do any actual serious computing, you couldn't do that on a PC. And yet they ate the world.
I can easily see the advantage, even now, of running the LLM locally. As others have said in this topic. I think it'll happen.
edit: thanks for clarifying :)
blizdiddy 20 hours ago [-]
Is it so hard to project out a couple product cycles? Computers get better. We’ve gone from $50k workstation to commodity hardware before several times
kcb 18 hours ago [-]
Subscription services get all the same benefits from computer hardware getting better. But actually due to scale, batching, resource utilization, they'll always be able to take more advantage of that.
CamperBob2 19 hours ago [-]
It will run exactly the same tomorrow, and the next day, and the day after that, and 10 years from now. It will be just as smart as the day you downloaded the weights. It won't stop working, exhaust your token quota, or get any worse.
That's a valuable guarantee. So valuable, in fact, that you won't get it from Anthropic, OpenAI, or Google at any price.
kcb 17 hours ago [-]
That's why we all still use our e machines its never obsolete PCs. Works just the same it did 20 years ago, though probably not because I've never heard of hardware that's guaranteed not to fail.
fwipsy 19 hours ago [-]
Agree directionally but you don't need $50k. $5k is plenty, $2-3k arguably the sweet spot.
unlikelytomato 19 hours ago [-]
as a local LLM novice, do you have any recommended reading to bootstrap me on selecting hardware? It has been quite confusing bring a latecomer to this game. Googling yields me a lot of outdated info.
fwipsy 17 hours ago [-]
First answer: If you haven't, give it a shot on whatever you already have. MoE models like Qwen3 and GPT-OSS are good on low-end hardware. My RTX 4060 can run qwen3:30b at a comfortable reading pace even though 2/3 of it spills over into system RAM. Even on an 8-year-old tiny PC with 32gb it's still usable.
Second answer: ask an AI, but prices have risen dramatically since their training cutoff, so be sure to get them to check current prices.
Third answer: I'm not an expert by a long shot, but I like building my own PCs. If I were to upgrade, I would buy one of these:
Framework desktop with 128gb for $3k or mainboard-only for $2700 (could just swap it into my gaming PC.) Or any other Strix Halo (ryzen AI 385 and above) mini PC with 64/96/128gb; more is better of course. Most integrated GPUs are constrained by memory bandwidth. Strix Halo has a wider memory bus and so it's a good way to get lots of high-bandwidth shared system/video RAM for relatively cheap. 380=40%; 385=80%; 395=100% GPU power.
I was also considering doing a much hackier build with 2x Tesla P100s (16gb HBM2 each for about $90 each) in a precision 5820 (cheap with lots of space and power for GPUs.) Total about $500 for 32gb HBM2+32gb system RAM but it's all 10-year-old used parts, need to DIY fan setup for the GPUs, and software support is very spotty. Definitely a tinker project; here there be dragons.
terbo 14 hours ago [-]
Agree on the framework, last week you could get a strix halo for $2700 shipped now it's over $3500, find a deal on a NVME and the framework with the noctua is probably going to be the quietest, some of them are pretty loud and hot.
I run qwen 122b with Claude code and nanoclaw, it's pretty decent but this stuff is nowhere prime time ready, but super fun to tinker with. I have to keep updating drivers and see speed increases and stability being worked on. I can even run much larger models with llama.cpp (--fit on) like qwen 397b and I suppose any larger model like GLM, it's slow but smart.
kcb 18 hours ago [-]
The 4-bit quants are 350GB, what hardware are you talking about?
fwipsy 17 hours ago [-]
qwen3:0.6b is 523mb, what model are you talking about? You seem to have a specific one in mind but the parent comment doesn't mention any.
For a hobby/enthusiast product, and even for some useful local tasks, MoE models run fine on gaming PCs or even older midrange PCs. For dedicated AI hardware I was thinking of Strix Halo - with 128gb is currently $2-3k. None of this will replace a Claude subscription.
0x457 3 hours ago [-]
> qwen3:0.6b is 523mb, what model are you talking about?
1) What are you going to use that for? 0.6 model gives you what you could get from Siri when it first launched at most unless you do some tunning.
2) Pretty clear that they are talking about GLM-5.1 4-bit quant.
Glaklloo 12 hours ago [-]
Google doesn't release Gemma 4 if Gemini is similiar good.
We probably talk abuot a year of progress diffeerence.
Its also still quite expensive for an avg person to consume any of it. Either due to hardware invest, energy cost or API cost.
Also professionally I don't think anyone will really spend a little bit less money of having the 3th quality model running if they can run the best model.
I'm happy that we reach levels were this becomes an alternative if you value open and control though.
hodgehog11 19 hours ago [-]
(1) is absolutely not true if you actually use these models on a regular basis and include Google in here too. The difference in reliability beyond basic tasks is night and day. Their reward function is just so much better, and there are many nuanced reasons for this.
(2) is probably true but with caveats. Top-tier models will never run on desktop machines, but companies should (and do) host their own models. The future is open-weight though, that much is for sure.
(3) This is so ignorant that others have already responded to it. Look outside of your own bubble, please.
neonstatic 19 hours ago [-]
> Top-tier models will never run on desktop machines
Sorry, but you don't know that
Yiin 18 hours ago [-]
I mean it's not hard to understand that if good model can run on consumer hardware, even better models can run in data centers
francasso 13 hours ago [-]
If we get to the point where a local model can reliably do the coding for a good majority of cases, then the economic landscape changes significantly. And we are not that far from having big open weight models that can do that, which is a first step
neonstatic 16 hours ago [-]
Larger, yes, absolutely. Better? Right now it seems that bigger is better, but if we are thinking about long term future, it's not obvious that there isn't a point of diminishing returns with regards to size. I can also imagine a breakthrough, where models become much smaller, with the same or better capabilities as the current, very large ones.
hodgehog11 14 hours ago [-]
You are always going to get the same scaling laws in model size regardless of what else you do, so the same degree of improvement seen now relative to the smaller models will be achievable in the future. Yes, small models may be on par with previous generation large models, but the same is true for processors and you don't see supercomputers going away. It's the same principle.
AlienRobot 18 hours ago [-]
I was trying to use Claude.ai today to learn how to do hexagonal geometry.
Every time I asked a question it generated an interactive geometry graph on the fly in Javascript. Sometimes it spent minutes compiling and testing code on the server so it could make sure it was correct. I was really impressed.
Anyway I couldn't really learn anything since when the code didn't work I wasn't sure if I had ported it wrong or the AI did it wrong, so I ended up learning how to calculate SDF and pixel to hex grid from tutorials I found on google instead.
jurschreuder 15 hours ago [-]
This is also my exact experience
mgfist 20 hours ago [-]
Posted this after mythos came out? The hutzpah
fwipsy 19 hours ago [-]
No moat: yes. Cooked: no. It's a race. Why assume they're going to lose? It relies on (2) which is only true if AI usefulness plateaus at some level of compute. That's a huge claim to be making at this stage.
(3) AI has lots of killer products already. The big one is filling in moats. Unrealized potential though for sure.
DeathArrow 16 hours ago [-]
>(1) OpenAI & Anthropic are absolutely cooked; it's obvious they have no moat
I think big corporations will continue to use them no matter how cheap and good other models are. There's a saying: nobody was fired for buying IBM.
IncreasePosts 18 hours ago [-]
How good would open source models be if they couldn't distill higher quality private models?
anon291 20 hours ago [-]
(3) is simply a lie spread by engineers who have no other context. I manage some real estate (mid-term rentals) and everyone I know has switched over to AI robo-handlers to do the contact at this point. It's almost a passive investment at this point. Some can even handle interfacing with contractors and service requests for you. Revolutionized the field in my opinion.
neonstatic 19 hours ago [-]
The model is the killer product
johnfn 1 days ago [-]
GLM-5.0 is the real deal as far as open source models go. In our internal benchmarks it consistently outperforms other open source models, and was on par with things like GPT-5.2. Note that we don't use it for coding - we use it for more fuzzy tasks.
sourcecodeplz 1 days ago [-]
Yep, haven't tried 5.1 but for my PHP coding, GLM-5 is 99% the same as Sonnet/Opus/GPT-5 levels. It is unbelievably strong for what it costs, not to mention you can run it locally.
deepsquirrelnet 1 days ago [-]
I am working on a large scale dataset for producing agent traces for Python <> cython conversion with tooling, and it is second only to gemini pro 3.1 in acceptance rates (16% vs 26%).
Mid-sized models like gpt-oss minimax and qwen3.5 122b are around 6%, and gemma4 31b around 7% (but much slower).
I haven’t tried Opus or ChatGPT due to high costs on openrouter for this application.
foopod 18 hours ago [-]
It really bothers me that people refer to open weight models as being open source. They fundamentally aren't and are more akin to freeware than anything else.
epolanski 1 days ago [-]
Same thing I noticed.
My use cases are not code editing or authoring related, but when it comes to understanding a codebase and it's docs to help stakeholders write tasks or understand systems it has always outperformed american models at roughly half the price.
minimaxir 1 days ago [-]
The focus on the speed of the agent generated code as a measure of model quality is unusual and interesting. I've been focusing on intentionally benchmaxxing agentic projects (e.g. "create benchmarks, get a baseline, then make the benchmarks 1.4x faster or better without cheating the benchmarks or causing any regression in output quality") and Opus 4.6 does it very well: in Rust, it can find enough low-level optimizations to make already-fast Rust code up to 6x faster while still passing all tests.
It's a fun way to quantify the real-world performance between models that's more practical and actionable.
winterqt 1 days ago [-]
Comments here seem to be talking like they've used this model for longer than a few hours -- is this true, or are y'all just sharing your initial thoughts?
KaoruAoiShiho 1 days ago [-]
Blog post is new but the model is about 2 weeks in public.
stavros 1 days ago [-]
My local tennis court's reservation website was broken and I couldn't cancel a reservation, and I asked GLM-5.1 if it can figure out the API. Five minutes later, I check and it had found a /cancel.php URL that accepted an ID but the ID wasn't exposed anywhere, so it found and was exploiting a blind SQL injection vulnerability to find my reservation ID.
Overeager, but I was really really impressed.
disiplus 1 days ago [-]
Yeah it seems they did not align it to much, at least for now. Yesterday it helped me bypass the bot detection on a local marketplace. that i wanted to scrap some listing for my personal alerting system. Al the others failed but glm5.1 found a set of parameters and tweaks how to make my browser in container not be detected.
qingcharles 20 hours ago [-]
I always jump on the Chinese models when I'm trying to do something that the US ones chastise me for, they're a little more liberal, especially around copyright.
ReptileMan 24 hours ago [-]
Model doing what the user wants with high quality is definitely aligned in my book.
wolttam 4 hours ago [-]
This can never go wrong!
smallerize 20 hours ago [-]
It's too much in the direction of the paperclip maxmizer for me. It should only hack sites when explicitly directed to, not as a default.
cesarb 20 hours ago [-]
> Five minutes later, I check and it had found a /cancel.php URL that accepted an ID but the ID wasn't exposed anywhere, so it found and was exploiting a blind SQL injection vulnerability to find my reservation ID.
The (none) version especially shows considerable degradation.
kamranjon 1 days ago [-]
I'm crossing my fingers they release a flash version of this. GLM 4.7 Flash is the main model I use locally for agentic coding work, it's pretty incredible. Didn't find anything in the release about it - but hoping it's on the horizon.
clark1013 12 hours ago [-]
I’ve been using GLM 5.1 instead of GPT 5.4 for a few days now, and it’s working smoothly.
5 hours ago [-]
RickHull 1 days ago [-]
I am on their "Coding Lite" plan, which I got a lot of use out of for a few months, but it has been seriously gimped now. Obvious quantization issues, going in circles, flipping from X to !X, injecting chinese characters. It is useless now for any serious coding work.
unicornfinder 1 days ago [-]
I'm on their pro plan and I respectfully disagree - it's genuinely excellent with GLM 5.1 so long as you remember to /compact once it hits around 100k tokens. At that point it's pretty much broken and entirely unusable, but if you keep context under about 100k it's genuinely on par with Opus for me, and in some ways it's arguably better.
airstrike 1 days ago [-]
100k tokens it's basically nothing these days. Claude Opus 4.6M with 1M context windows is just a different ball game
plandis 1 days ago [-]
Claude Opus can use a 1M context window but I’ve found it to degrade significantly past 250k in practice.
marcus_holmes 19 hours ago [-]
Seconded. I'm getting used to the changes that happen in the conversation now, and can work out when it's time for my little coding buddy to have a nap.
And Opus is absolutely terrible at guessing how many tokens it's used. Having that as a number that the model can access itself would be a real boon.
wild_egg 1 days ago [-]
The Dumb Zone for Opus has always started at 80-100k tokens. The 1M token window just made the dumb zone bigger. Probably fine if the work isn't complicated but really I never want an Opus session to go much beyond 100k.
braebo 1 days ago [-]
The cost per message increases with context while quality decreases so it’s still generally good to practice strategic context engineering. Even with cross-repo changes on enterprise systems, it’s uncommon to need more than 100k (unless I’m using playwright mcp for testing).
bredren 1 days ago [-]
I had thought this, but my experience initially was that performance degradation began getting noticeable not long after crossing the old 250k barrier.
So, it has been convenient to not have hard stops / allow for extra but I still try to /clear at an actual 25% of the 1M anyhow.
This is in contrast to my use of the 1M opus model this past fall over the API, which seemed to perform more steadily.
syntaxing 1 days ago [-]
I’m genuinely surprised. I use copilot at work which is capped at 128K regardless of model and it’s a monorepo. Admittedly I know our code base really well so I can point towards different things quickly directly but I don’t think I ever needed compacting more than a handful in the past year. Let alone 1M tokens.
arcanemachiner 1 days ago [-]
Personal opinions follow:
Claude Opus at 150K context starts getting dumber and dumber.
Claude Opus at 200K+ is mentally retarded. Abandon hope and start wrapping up the session.
operatingthetan 1 days ago [-]
The context windows of these Chinese open-source subscriptions (GLM, Minimax, Kimi) is too small and I'm guessing it's because they are trying to keep them cheap to run. Fine for openclaw, not so much for coding.
thawab 1 days ago [-]
Don’t want to disappoint you, but above 200k opus memory is like a gold fish. You need to be below 150k to get good research and implementation.
arcanemachiner 1 days ago [-]
Oh nice, I just wrote pretty much the same comment above yours.
epolanski 1 days ago [-]
Quality degrades fast with context length for all models.
If you want quality you still have to compact or start new contextes often.
kay_o 1 days ago [-]
Is manual compation absolutely mandatory ?
DeathArrow 12 hours ago [-]
When using GLM 5.1 in Open Code, compaction was done automatically.
jauntywundrkind 1 days ago [-]
I haven't screenshotted to alas, but it goes from being a perfectly reasonable chatty LLM, to suddenly spewing words and nonsense characters around this threshold, at least for me as a z.ai pro (mid tier) user.
For around a month the limit seemed to be a little over 60k! I was despondent!!
What's worse is that when it launched it was stable across the context window. My (wild) guess is that the model is stable but z.ai is doing something wonky with infrastructure, that they are trying to move from one context window to another or have some kv cache issues or some such, and it doesn't really work. If you fork or cancel in OpenCode there's a chance you see the issue much earlier, which feels like some other kind of hint about kv caching, maybe it not porting well between different shaped systems.
More maliciously minded, this artificial limit also gives them an artificial way to dial in system load. Just not delivering the context window the model has reduces the work of what they have to host?
But to the question: yes compaction is absolutely required. The ai can't even speak it's just a jumbled stream of words and punctuation once this hits. Is manual compaction required? One could find a way to build this into the harness, so no, it's a limitation of our tooling that our tooling doesn't work around the stated context window being (effectively) a lie.
I'd really like to see this improved! At least it's not 60-65k anymore; those were soul crushing weeks, where I felt like my treasured celebrated joyful z.ai plan was now near worthless.
The question is: will this reproduce on other hosts, now that glm-5.1 is released? I expect the issue is going to be z.ai specific, given what I've seen (200k works -> 60k -> 100k context windows working on glm-5.1).
calgoo 1 days ago [-]
I have gone back to having it create a todo.md file and break it into very small tasks. Then i just loop over each task with a clear context, and it works fine. a design.md or similar also helps, but most of the time i just have that all in a README.md file. I was also suspicious around the 100k almost to the token for it to start doing loops etc.
disiplus 1 days ago [-]
basically my expirience as well. Sometimes it can break past 100k and be ok, but mostly it breaks down.
kay_o 1 days ago [-]
I am on the mid tier Coding plan to trying it out for the sake of curiosity.
During off peak hour a simple 3 line CSS change took over 50 minutes and it routinely times out mid-tool and leaves dangling XML and tool uses everywhere, overwriting files badly or patching duplicate lines into files
harias 1 days ago [-]
Off peak for China or US
kay_o 1 days ago [-]
Off peak for China. Off peak times are only in one timezone
InsideOutSanta 1 days ago [-]
My impression is that different users get vastly different service, possibly based on location. I live in Western Europe, and it works perfectly for me. Never had a single timeout or noticeable quality degradation. My brother lives in East Asia, and it's unusable for him. Some days, it just literally does not work, no API calls are successful. Other days, it's slow or seems dumber than it should be.
kay_o 17 hours ago [-]
It's now mid weekday in China timezone.
Starting an hour or two ago GLM's API endpoint is failing 7/8 times for me, my editor is retrying every request with backoff over a dozen times before it succeeds and wildly simple changes are taking over 30 minutes per step.
csomar 16 hours ago [-]
Their distribution operation is very bad right now. The model is pretty decent when it works but they have lots of issues serving the people. That being said, I have had the same problems with Gemini (even worse in the last two weeks) and Claude. So it seems to be the norm in the industry.
satvikpendem 1 days ago [-]
Every model seems that way, going back to even GPT 3 and 4, the company comes out with a very impressive model that then regresses over a few months as the company tries to rein in inference costs through quantization and other methods.
wolttam 1 days ago [-]
This is surprising to me. Maybe because I'm on Pro, and not Lite. I signed up last week and managed to get a ton of good work done with 5.1. I think I did run into the odd quantization quirk, but overall: $30 well spent
Mashimo 1 days ago [-]
I'm also on the lite plan and have been using 5.1 for a few days now. It works fine for me.
But it's all casual side projects.
Edit: I often to /compact at around 100 000 token or switch to a new session. Maybe that is why.
LaurensBER 1 days ago [-]
I'm on their lite plan as well and I've been using it for my OpenClaw. It had some issues but it also one-shotted a very impressive dashboard for my Twitter bookmarks.
For the price this is a pretty damn impressive model.
cmrdporcupine 1 days ago [-]
Is there any advantage to their fixed payment plans at all vs just using this model via 3rd party providers via openrouter, given how relatively cheap they tend to be on a per-token basis?
That's more expensive than other models, but not terrible, and will go down over time, and is far far cheaper than Opus or Sonnet or GPT.
I haven't had any bad luck with DeepInfra in particular with quantization or rate limiting. But I've only heard bad things about people who used z.ai directly.
Lalabadie 21 hours ago [-]
I use GLM 5 Turbo sporadically for a client, and my Openrouter expense might climb over a dollar per day if I insist. At about 20 work days per month it's an easy choice.
csomar 16 hours ago [-]
I have their most expensive plan and it's on-par and sometimes better than Claude although you have to keep context short. That being said, the quota is no longer generous. It's still priced below Claude but not by that much. (compared to a few months ago where your money gets you x10 in tokens)
esafak 1 days ago [-]
I'm on their Lite plan and I see some of this too. It is also slow. I use it as a backup.
benterix 1 days ago [-]
> Obvious quantization issues
Devil's advocate: why shouldn't they do it if OpenAI, Anthropic and Google get away with playing this game?
cmrdporcupine 1 days ago [-]
I think what Anthropic is doing is more subtle. It's less about quantizing and more about depth of thinking. They control it on their end and they're dynamically fiddling with those knobs.
margorczynski 1 days ago [-]
It has been useless for long time when compared to Opus or even something like Kimi. The saving grace was that it was dirt cheap but that doesn't matter if it can't do what I want even after many repeated tries and trying to push it to a correct solution.
I find the "8 hour Linux Desktop" bit disingenuous, in the fine print it's a browser page:
> "build a Linux-style desktop environment as a web application"
They claim "50 applications from scratch", but "Browser" and a bunch of the other apps are likely all <iframe> elements.
We all know that building a spec-compliant browser alone is a herculean task.
MrPowerGamerBR 1 days ago [-]
In my opinion it would be way cooler if it actually created a real Linux desktop environment instead of only a replica.
Would it succeed? Probably not, but it would be way more interesting, even if it didn't work.
I find things like Claude's C compiler way more interesting where, even though CCC is objectively bad (code is messy, generates very bad unoptimized code, etc) it at least is something cool and shows that with some human guideance it could generate something even better.
bredren 1 days ago [-]
It is a big claim without the source and prompting.
bdeol22 11 hours ago [-]
Long-horizon demos are fun; the product test is still interrupted real life—can it pick up three days later without you re-teaching context?
mark_l_watson 1 days ago [-]
I can’t wait to try it. I set up a new system this morning with OpenClaw and GLM-5, and I like GLM-5 as the backend for Claude Code. Excellent results.
Alifatisk 10 hours ago [-]
There is also GLM-5-Turbo, have you tried it for your claw?
8dazo 20 hours ago [-]
Just saw the Claude Mythos post. Not sure when it’s going public, but this feels like a real jump, not just incremental progress. Also waiting for the next GLM release coz specs are looking kind of insane.
zozbot234 19 hours ago [-]
Gemini and GPT have Deep Research models already, Mythos looks like much the same thing.
blazespin 23 hours ago [-]
Anthropic's reply? A model you can't use.
minimaxir 23 hours ago [-]
Mythos is most definitely not in response to this announcement.
tgtweak 1 days ago [-]
Share the harness for that browser linux OS task :)
DeathArrow 1 days ago [-]
I am already subscribed to their GLM Coding Pro monthly plan and working with GLM 5.1 coupled with Open Code is such a pleasure! I will cancel my Cursor subscription.
epolanski 1 days ago [-]
I was very satisfied with GLM5, I'm not gonna lie.
Excited to test this.
EITB_2026 16 hours ago [-]
Good One Though
maxdo 1 days ago [-]
One of the bench maxed models . Every time I tried it , it’s not on par even with other open source models .
wallmountedtv 22 hours ago [-]
Feeling very much the same. Attempting to use it through Claude Code as a model it just completely lost all context on what it was doing after a few months and kept short circuiting even with the most helpful prompts I could give, outside of just writing out the answer myself. I really do not get the praise for this model.
Being "better than Opus 4.6" is not really something a benchmark will tell you. It's much more a consensus of users liking the flavor of an answer, rather than fueling x% correct on a benchmark.
philipwhiuk 20 hours ago [-]
This is the flip side of the Project Glasswing stuff...
Everyone else isn't that far behind and they aren't all gonna just wall off their new model.
A reason that Anthropic will eventually give is 'the competition can do what Glasswing can do so what's the point limiting it'.
Is there really no rule that discourages 99% of your interactions with HN from being peddling some useless slop benchmark?
XCSme 12 hours ago [-]
If it's relevant to the discussion, I hope not.
I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.
Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.
jaggs 10 hours ago [-]
It's a great benchmark. Don't listen to the haters. This one is especially interesting.
Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.
eis 1 days ago [-]
The blog post has a benchmark comparison table with these two in it
jaggs 1 days ago [-]
Thanks, I missed that. It's very interesting. They're quite close, but I found Qwen 3.6 plus was just marginally better than Kimi 2.5. But looking at the stats I'll definitely give GLM 5.1 a try now. [edit: even though looking at it, it's not cheap and has a much smaller context size.And I can't tell about tool use.]
DeathArrow 1 days ago [-]
Compared to Kimi 2.5 or Qwen 3.6 Plus I don't know, but I ran GLM 5 (not 5.1) side by side with Qwen 3.5 Plus and it was visibly better.
bigyabai 1 days ago [-]
It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts. When you crest 128k tokens, there's a high chance that the model will start spouting gibberish until you compact the history.
For short-term bugfixing and tweaks though, it does about what I'd expect from Sonnet for a pretty low price.
cassianoleal 1 days ago [-]
I've done some very long sessions on OpenCode with Dynamic Context Pruning. Highly recommend it.
> It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts
Since the entire purpose, focus and motivation of this model seems to have been "coherency over longer contexts", doesn't that issue makes it not an OK model? It's bad at the thing it's supposed to be good at, no?
wolttam 1 days ago [-]
long(er) contexts (than the previous model)
It does devolve into gibberish at long context (~120k+ tokens by my estimation but I haven't properly measured), but this is still by far the best bang-for-buck value model I have used for coding.
It's a fine model
disiplus 1 days ago [-]
i have glm and kimi. kimi was in most of the cases better and my replacement for claude when i run out of tokens. Now im finding myself using glm more then kimi. Its funny that glm vs kimi, is like codex vs claude. Where glm and codex are better for backend and kimi and claude more for frontend.
as kimi did a huge amount of claude distilation it seems to be somewhat based in data
I'm curious how the bang for buck ratio works in comparison. My initial tests for coding tasks have been positive and I can run it at home. Bigger models I assume are still better on harder tasks.
whimblepop 1 days ago [-]
That's pretty few, at least for the way I'm currently using LLMs. I have them do some Nix work (both debugging and coding) where accuracy and quality matters to me, so they're instructed to behave as I would when it comes to docs, always consulting certain docs and source code in a specific order. It's not unusual for them to chew through 200k - 600k tokens in a single session before they solve everything I want them to. That's what I currently think of when I think of "long horizon within a single context window".
So I need them to not only not devolve into gibberish, but remain smart enough to be useful at contexts several times longer than that.
nkko 1 days ago [-]
Yes, this is frustrating, but it doesn’t occur in CC. I run the conversation logs through an agent and opencode source, and it identified an issue in the reasoning implementation of opencode for Zai models. Consequently, I ceased my research and opted to use CC instead.
HumanOstrich 1 days ago [-]
I wonder if running the compaction in a degraded state produces a subpar summary to continue with.
gunalx 24 hours ago [-]
Indeed it does. Once i see degraded state i revert to last task and run a compact, before starting up again.
jauntywundrkind 1 days ago [-]
Chiming in to second this issue. It is wildly frustrating.
I suspect that this isn't the model, but something that z.ai is doing with hosting it. At launch I was related to find glm-5.1 was stable even as the context window filled all the way up (~200k). Where-as glm-5, while it could still talk and think, but had forgotten the finer points of tool use to the point where it was making grevious errors as it went (burning gobs of tokens to fix duplicate code problems).
However, real brutal changes happened sometimes in the last two or three months: the parent problem emerged and emerged hard, out of nowhere. Worse, for me, it seemed to be around 60k context windows, which was heinous: I was honestly a bit despondent that my z.ai subscription had become so effectively useless. That I could only work on small problems.
Thankfully the coherency barrier raised signficiantly around three weeks go. It now seems to lose its mind and emits chaotic non-sentance gibberish around 100k for me. GLM-5 was already getting pretty shaky at this point, so I feel like I at least have some kind of parity. But at least glm-5 was speaking & thinking with real sentances, I could keep conversing with it somewhat, where-as glm-5.1 seems to go from perfectly level headed working fine to all of a sudden just total breakdown, hard switch, at such a predictable context window size.
It seems so so probable to me that this isn't the model that's making this happen: it's the hosting. There's some KV cache issue, or they are trying to expand the context window in some way, or to switch from one serving pool of small context to a big context serving pool, or something infrastructure wise that falls flat and collapses. Seeing the window so clearly change from 200k to 60k to 100k is both hope, but also, misery.
I've been leaving some breadcrumbs on Bluesky as I go. It's been brutal to see. Especially having tasted a working glm-5.1. I don't super want to pay API rates to someone else, but I fully expect this situation to not reproduce on other hosting, and may well spend the money to try and see. https://bsky.app/profile/jauntywk.bsky.social/post/3mhxep7ek...
All such a shame because aside from totally going mad & speaking unpuncutaed gibberish, glm-5.1 is clearly very very good and I trust it enormously.
ummzokbro 24 hours ago [-]
This.
GLM5 also had this issue. When it was free on Openrouter / Kilo the model was rock solid though did degrade after 100k tokens gracefully. Same at launch with Zai aside from regular timeouts.
Somewhere around early-mid March zai did something significant to GLM5 - like KV quanting or model quanting or both.
After that it's been russian roulette. Sometimes it works flawlessly but very often (1/4 or 1/5 of the time) thinking tokens spill into main context and if you don't spot it happening it can do real damage - heavily corrupting files, deleting whole directories.
You can see the pain by visiting the zai discord - filled with reports of the issue yet radio silence by zai.
Tellingly despite being open source not a single provider will sell you access to this model at anything approaching the plans zai offers. The numbers just don't work so your choice is either pay per token significantly more and get reliability or put up with the bait and switch.
throwdbaaway 23 hours ago [-]
https://github.com/THUDM/IndexCache - Might be some expected issue when rolling out this. They don't have enough compute, and have to innovate.
girvo 23 hours ago [-]
This doesn’t help you, but GLM-5 stays coherent far longer on Alibaba’s coding plan/infra. You can’t get that coding plan anymore though unfortunately!
esseph 1 days ago [-]
> "aside from totally going mad & speaking unpuncutaed gibberish [...] I trust it enormously."
The bar is very low :(
jauntywundrkind 1 days ago [-]
I see where you are coming from.
But I used 70m tokens yesterday on glm-5.1 (thanks glm for having good observability of your token usage unlike openai, dunno about anthropic). And got incredible beautiful results that I super trust. It's done amazing work.
This limitation feels very shady and artificial to me, and i don't love this, but I also feel like I'm working somewhat effectively within the constraints. This does put a huge damper on people running more autonomous agentic systems, unless they have Pi or other systems that can more self adaptively improve the harness.
azuanrb 1 days ago [-]
Have you compared it with using Claude Code as the harness? It performs much better than opencode for me.
redoh 4 hours ago [-]
[dead]
claud_ia 10 hours ago [-]
[dead]
Manchitsanan 8 hours ago [-]
[dead]
dang 1 days ago [-]
[stub for offtopicness]
[[you guys, please don't post like this to HN - it will just irritate the community and get you flamed]]
smith7018 1 days ago [-]
Hmm, three spam comments posted within 9 minutes of each other. The accounts were created 15 minutes ago, 51 days ago, and 3 months ago.
Interesting.
Hopefully these aren't bots created by Z.AI because GLM doesn't need fake engagement.
dang 1 days ago [-]
These comments are probably either by friends of the OP or perhaps associated with the project somehow, which is against HN's rules but not the kind of attack we're mostly concerned with these days. Old-fashioned voting rings and booster comments aren't existential threats and actually bring up somewhat nostalgic feelings at the moment!
Thanks for watching out for the quality of HN...
ray__ 1 days ago [-]
Would love to read a Tell HN post about the kinds of attacks you are concerned with!
dang 23 hours ago [-]
For example, there are rings of accounts posting generated comments, presumably in order to build karma for spammy or (let's be kind) promotional reasons. There are also plenty of spam rings that create tons of accounts and whatnot.
These are different from the submitter-passed-a-link-to-friends kind of upvoting and booster comments, which feel quaint by comparison. In this case people usually don't know they are breaking HN's rules, which is why they don't try to hide it.
tadfisher 1 days ago [-]
I moderate a medium-sized development subreddit. The sheer volume of spam advertising some AI SaaS company has skyrocketed over the past few months, like 10000%. Comment spam is now a service you can purchase [0][1], and I would not be surprised if Z.ai engaged some marketing firm which ended up purchasing this service.
There are YC members in the current batch who are spamming us right now [2]. They are all obvious engagement-bait questions which are conveniently answered with references to the SaaS.
Z.ai Discord is filled to the brim with people experiencing capacity issues. I had to cancel my subscription with Z.ai because the service was totally unusable. Their Discord is a graveyard of failures. I switched to Alibaba Cloud for GLM but now they hiked their coding plan to $50 a month which is 2.5x more expensive than ChatGPT Plus. Totally insane.
sourcecodeplz 1 days ago [-]
Everyone has started either hiking their prices or limiting the tokens, gravy train is over. Glad we have open models that we can host; Sad RAM is so expensive..
zendi 1 days ago [-]
[flagged]
louszbd 1 days ago [-]
[flagged]
seven2928 1 days ago [-]
[flagged]
meidad_g 20 hours ago [-]
[dead]
EddyAI 1 days ago [-]
[dead]
aplomb1026 1 days ago [-]
[dead]
aryehof 15 hours ago [-]
[dead]
andrewmcwatters 1 days ago [-]
[dead]
dryarzeg 22 hours ago [-]
A bit off-topic, but for some reason, even though I don't use LLMs for my job or for my hobbies, or in daily life frequently (and when I do, it's mostly some kind of "rubber duck brainstorm"), when I see open-weight releases like this one or the recent Gemma 4 (which is very good for local models); the first time was with DeepSeek-R1 (this one, despite being blamed for "censorship", was heavily censored only via DeepSeek API, the local model - full-weight 685B, not the distilled ones - was pretty much unhinged regarding censorship on any topic)... there's always one song coming to mind and I simply can't get rid of it no matter how hard I try.
"I am the storm that is approaching, provoking..." : )
Rendered at 20:32:24 GMT+0000 (Coordinated Universal Time) with Vercel.
Contrary to the model card, its one-shot performance is more impressive than its agentic abilities. On both metrics, GLM 5.1 is competitive with frontier models.
But keeping in mind this is an open source model operating near the frontier, it's nothing short of incredible.
I suspect 2 issues with the model are keeping it from fully realizing its potential in agentic harnesses: - Context rot (already a common complaint). We are still working on a metric to robustly test and visualize this on the site. - The model was most likely overtrained on standardized toolsets and benchmarks, and isn't as adaptive in using arbitrary tooling in our custom harness simulations. We've decided to commit to measuring intelligence as the ability to use custom, changing tools, instead of being trained to use specific tools (while still always providing a way to run local bash and other common tools). There are arguments to be made for either, but the former is more indicative of general intelligence. Regardless, it's a subtle difference and GLM 5.1 still performs well with tooling in our environments.
Crazy week for open source AI. Gemma 4 has shown that large model density is nowhere near optimized. Moats are shrinking.
If there are more representations of model performance you'd like to see, I'm actively reading your feedback and ideas.
My impression is that the choice of harness matters a lot.
I've been testing it for awhile now since it seemed to have potential as a local model.
With this new update it still cannot parse simple, test PDFs correctly. It inconsistently tells me that the value in the name field in the document is incorrect, and has the name reversed to put the last name first. Or that a date is wrong as it's in the past/future, when it is not. Tons of fundamental errors like that.
Even when looking at the thinking process there are issues:
I used a test website for it to analyze and it says that the sites copyright year states 2026 which is in the future and to investigate as it could be an attack, but right after prints today's correct date.
I'm in the process of trying to get it uncensored. Hopefully that will create some use out of z.ai
Edit: by the way, which is the best uncensored model at the moment?
I also use Claude premium daily for another client, and i use Codex. and i can tell you that GLM5 is at this point much more capable than Claude and Codex for complex backend end work, complex feature planning, and long horizon tasks. One thing i've noticed is that it is particularly good at following instructions and guidelines, even deep into the execution of a plan.
To me the only problem is that z.ai have had trouble with inference : the performance of their API has been pretty poor at times. It looks like this is an hardware issue related to the Huawei chips they use rather than an issue with the model itself. The situation has been substantially improving over the past few weeks.
GLM5.1, GLM5-Turbo and GLM5v are at this point better than Opus, Codex, Gemini and other claude source models. We have reached a major turning point. To me, the only closed source model still in the game is codex as it is much faster at executing simple tasks and implementing already created plans.
Try GLM5v for your PDF work, it's their last generation vision model that has been released a couple of days ago.
>For AI computing, the Atlas 950 SuperPoD, powered by UnifiedBus, integrates 64 NPUs per cabinet and can scale up to 8,192 NPUs, delivering superior performance for large-scale AI training and high-concurrency inference.
Codex and GLM didnt have any issue following the exact same plan and getting a working app. So I would argue Gemini is the failure here.
"It couldn't even debug some moderately complicated python scripts reliably."
What wild claim to make. Unsupported by benchmarks, unsupported by the consensus of the community, no evidence provided.
Sounds like in another comment here even the GLM5 team concedes they are behind the frontier wrt tool calling, do you know something they don’t?
My only goal is to encourage people to try it out so they can see if it moves the needle for them, because there are fair chances that it will. I am not trying to start a flamewar or something.
You’re making a claim, and I’m pointing out that it’s unsubstantiated and not consistent with any other source of data, including that internal to the company that makes the model.
I hope you can see that that’s different than saying it’s worked well for me
I do not think that anyone who read my comment understood it differently. But I grant you this point, this is just my opinion based on my personal experience not the result of a scientific study.
Once this is said, i wasn't submitting a scientific paper for preprint, just posting my opinion on an internet forum.
Not sure why you are making such a big deal out of it, especially for something for which people can decide within minutes if it works for them or not. And I haven't seen you nitpick on other people saying that all Chinese models are garbage incapable of doing even the most basic task, without quoting any study. This kind of scrutiny tends to be one-sided.
Edit: and regarding what the z.ai team is saying about their models, just check their Discord and the articles they link there. They themselves say that their latest models have leading performance on a number of aspects. It is misleading to suggest that the authors of the model are not proudly saying that their models have best in class performance.
https://huggingface.co/trohrbaugh/gemma-4-31b-it-heretic-ara...
which was produced immediately after Google released their new Gemma 4 model.
I had no such trouble with 4.7 and find it fast and productive. Haven't tried 5.1; am using openAI models for coding most of the time.
Z.ai seem to promote 4.7 for smaller tasks, 5.1 for larger tasks (similar to Anthropic's recommendation for usage of Haiku and Sonnet/Opus models).
5.1 works for me already in the most economical basic paid tier ("lite coding plan"), unlike first release of v5 (5.0 ?)
There are no such models, depending on your definition of censorship. If you're referring to abliteration and similar automated techniques, they're snake oil.
[0] https://huggingface.co/unsloth/GLM-5.1-GGUF
Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.
With LLMs it feels more like the old punchcards, though.
For me, Opus 4.6 isn't working quite right currently, and I often use GLM 5.1 instead. I'd prefer to use peak Opus over GLM 5.1, but GLM 5.1 is an adequate fallback. It's incredible how good open-weight models have gotten.
i have a feeling its nearing opus 4.5 level if they could fix it getting crazy after like 100k tokens.
From my testing it was ok until 145k tokens, the largest context I had before switching to a new session. I think Z.ai officially said it should be good until 200k tokens.
Using it in Open Code is compacting the context automatically when it gets too large.
1) OpenAI and Anthropic are killing it, and continue to do so, their coding tools are unmatched for professionals.
2) Local models don't hold a candle to SOTA models and there's nothing on the horizon that indicates that consumers will be able to run anything close to what you can get in a data center.
3) Coding is a killer product, OpenAI and Anthropic are raking in the cash. The top 3 apps are apps in the app store are AI. Everyone who knows anything is using AI, every day, across the economy.
On (2), I agree with you for local models. BUT, there are also the open source Chinese models accessible via open-router. Your argument ("don't hold a candle to SOTA models") does not hold if the comparison is between those.
On (1), I agree more with the grandparent than with your assessment. Yes, OpenAI and Anthropic are killing it for now, but the time horizon is very short. I use codex and claude daily, but it's also clear to me that open source is catching up quickly, both w.r.t. the models and the agentic harnesses.
Nowadays I also feel model performance matters less than the design of the tool harness, inference speed, and the other systems that surround a typical coding model.
I thought so myself, but after burning a lot of money on OpenRouter in a few days I just subscribed to Z.ai's Coding Pro plan and using the subscription is much, much friendlier with my wallet.
And? They aren't as good as SOTA models. Even the SOTA model provider's small models aren't worth using for many of my coding tasks.
(1): You don't have to be an Ed Zitron disciple to infer that OpenAI and Anthropic are likely overvalued and that Nvidia is selling everyone shovels in a gold rush. AI is a game-changing technology, but a shitty chat interface does not a company make. OpenAI and Anthropic need to recoup astronomical costs used in training these models. Models that are now being distilled[1] and are quickly becoming commoditized. (And frankly, models that were trained by torrenting copyrighted data[2], anyway.) Many have been calling this out for years: the model cannot be your product. And to be clear, OpenAI/Anthropic most definitely know this: that's why they've been aquihiring like crazy, trying to find that one team that will make the thing.
(2): Token prices are significantly subsidized and anyone that does any serious work with AI can tell you this. Go use an almost-SOTA model (a big Deepseek or Qwen model) offered by many bare-metal providers and you'll see what "true" token prices should look like. The end-state here is likely some models running locally and some running in the cloud. But the current state of OpenClaw token-vomit on top of Claude is fiscally untenable (in fact, this is why Anthropic shut it down).
(3): This is typical Dropbox HN snark[3], of which I am also often guilty of. I really don't think AI coding is a killer product and this seems very myopic—engineers are an extreme minority. Imo, the closest we've seen to something revolutionary is OpenClaw, but it's janky, hard to set up, full of vulnerabilities, and you need to buy a separate computer. But there's certainly a spark there. (And that's personally the vertical I'm focusing on.)
[1] https://www.anthropic.com/news/detecting-and-preventing-dist...
[2] https://media.npr.org/assets/artslife/arts/2025/complaint.pd...
[3] https://news.ycombinator.com/item?id=9224
Anthropic is up to $30B annual recurring revenue. I wish I had failing business models like that.
> Token prices are significantly subsidized and anyone that does any serious work with AI can tell you this. Go use an almost-SOTA model (a big Deepseek or Qwen model) offered by many bare-metal providers and you'll see what "true" token prices should look like.
I'm not sure what think you are saying here, but if you look at the providers for both "almost-SOTA model (a big Deepseek or Qwen model)" or at the price for Claude on AWS Bedrock, Azure or on GCP you will quickly see inference is very profitable.
And profit? A company can have $300B annual revenue, and still be a failing business if it's making a loss.
Somewhere along the line we seem to have forgotten this basic fact. Eventually there will be no more rounds of funding to feed the fire.
Even if you say we are going to measure profit in the very special hacker news way of looking at money taken in from customer revenue against money invested and we say they can't do things like counting building data centers or buying GPUs as capital expenses and instead have to count them against profit then in 2 years time they will have made more money than they have taken in investment.
That is extraordinary.
Qwen3.5-122B-A10B is $0.26 input, $2.08 output. Where's the subsidy? It's ten times cheaper than Opus. Or did you mean that we're subsidizing their training? But then "OpenClaw token-vomit on top of Claude is fiscally untenable" makes no sense.
Yeah, I don't know where you got your costs from. Bare metal providers are significantly cheaper than Anthropic.
GPU and RAM prices have definitely not made consumer PC's cheaper than they were before bitcoin blew up or before AI blew up.
Maybe you could make an argument that they are more cost efficient for the price point... But that's not the same as cheaper when every application or program is poorly optimized. For example why would a browser take up more than a GB or two of RAM?
And I'd postulate that R&D to develop localized AI is another example, the big players seem hellbent that there needs to be a most and it's data centers... The absolute opposite of optimization
We've had RAM shocks before. We nerds can't control the Wall Street or Virginians who like to break the world every so often for the lulz. However, a wobble on the curve doesn't change the curve's destination.
Landing a man on the moon is way more impressive. Finding several vaccines for a once in a century pandemic within a year of its outbreak is and achievement that in its impact and importance dwarfs what the entire LLM industry put together has achieved. The near-complete eradication of polio, once again, way more important and impactful.
I'd like to think the superior product wins. But Windows still thrives despite widespread Linux availability. I think sometimes we can underestimate the resilience of the tech oligopolies, particularly when they're VC-funded.
If I want to switch from Windows to Linux, I have to reconsider a whole variety of applications, learn a different UX, migrate data, all sorts of annoyances.
When I switch between Codex and Claude Code, there is literally no difference in how I interact with them. They and a number of other competitors are drop in replacements for each other.
That's because by most metrics Linux is inferior is Windows.
GLM 5.1 has 754B parameters tho. And you still need RAM for context too. You'll want much more than 96GB ram.
I can totally see the same happening here; on-device LLMs are a toy, and then they eat the world and everyone has their own personal LLM running on their own device and the cloud LLMs are a niche used by large institutions.
I can easily see the advantage, even now, of running the LLM locally. As others have said in this topic. I think it'll happen.
edit: thanks for clarifying :)
That's a valuable guarantee. So valuable, in fact, that you won't get it from Anthropic, OpenAI, or Google at any price.
Second answer: ask an AI, but prices have risen dramatically since their training cutoff, so be sure to get them to check current prices.
Third answer: I'm not an expert by a long shot, but I like building my own PCs. If I were to upgrade, I would buy one of these:
Framework desktop with 128gb for $3k or mainboard-only for $2700 (could just swap it into my gaming PC.) Or any other Strix Halo (ryzen AI 385 and above) mini PC with 64/96/128gb; more is better of course. Most integrated GPUs are constrained by memory bandwidth. Strix Halo has a wider memory bus and so it's a good way to get lots of high-bandwidth shared system/video RAM for relatively cheap. 380=40%; 385=80%; 395=100% GPU power.
I was also considering doing a much hackier build with 2x Tesla P100s (16gb HBM2 each for about $90 each) in a precision 5820 (cheap with lots of space and power for GPUs.) Total about $500 for 32gb HBM2+32gb system RAM but it's all 10-year-old used parts, need to DIY fan setup for the GPUs, and software support is very spotty. Definitely a tinker project; here there be dragons.
I run qwen 122b with Claude code and nanoclaw, it's pretty decent but this stuff is nowhere prime time ready, but super fun to tinker with. I have to keep updating drivers and see speed increases and stability being worked on. I can even run much larger models with llama.cpp (--fit on) like qwen 397b and I suppose any larger model like GLM, it's slow but smart.
For a hobby/enthusiast product, and even for some useful local tasks, MoE models run fine on gaming PCs or even older midrange PCs. For dedicated AI hardware I was thinking of Strix Halo - with 128gb is currently $2-3k. None of this will replace a Claude subscription.
1) What are you going to use that for? 0.6 model gives you what you could get from Siri when it first launched at most unless you do some tunning.
2) Pretty clear that they are talking about GLM-5.1 4-bit quant.
We probably talk abuot a year of progress diffeerence.
Its also still quite expensive for an avg person to consume any of it. Either due to hardware invest, energy cost or API cost.
Also professionally I don't think anyone will really spend a little bit less money of having the 3th quality model running if they can run the best model.
I'm happy that we reach levels were this becomes an alternative if you value open and control though.
(2) is probably true but with caveats. Top-tier models will never run on desktop machines, but companies should (and do) host their own models. The future is open-weight though, that much is for sure.
(3) This is so ignorant that others have already responded to it. Look outside of your own bubble, please.
Sorry, but you don't know that
Every time I asked a question it generated an interactive geometry graph on the fly in Javascript. Sometimes it spent minutes compiling and testing code on the server so it could make sure it was correct. I was really impressed.
Anyway I couldn't really learn anything since when the code didn't work I wasn't sure if I had ported it wrong or the AI did it wrong, so I ended up learning how to calculate SDF and pixel to hex grid from tutorials I found on google instead.
I think big corporations will continue to use them no matter how cheap and good other models are. There's a saying: nobody was fired for buying IBM.
Mid-sized models like gpt-oss minimax and qwen3.5 122b are around 6%, and gemma4 31b around 7% (but much slower).
I haven’t tried Opus or ChatGPT due to high costs on openrouter for this application.
My use cases are not code editing or authoring related, but when it comes to understanding a codebase and it's docs to help stakeholders write tasks or understand systems it has always outperformed american models at roughly half the price.
It's a fun way to quantify the real-world performance between models that's more practical and actionable.
Overeager, but I was really really impressed.
xkcd was prescient once again... https://xkcd.com/416/
I think the model is now tuned more towards agentic use/coding than general intelligence.
[0]: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...
And Opus is absolutely terrible at guessing how many tokens it's used. Having that as a number that the model can access itself would be a real boon.
So, it has been convenient to not have hard stops / allow for extra but I still try to /clear at an actual 25% of the 1M anyhow.
This is in contrast to my use of the 1M opus model this past fall over the API, which seemed to perform more steadily.
Claude Opus at 150K context starts getting dumber and dumber.
Claude Opus at 200K+ is mentally retarded. Abandon hope and start wrapping up the session.
If you want quality you still have to compact or start new contextes often.
For around a month the limit seemed to be a little over 60k! I was despondent!!
What's worse is that when it launched it was stable across the context window. My (wild) guess is that the model is stable but z.ai is doing something wonky with infrastructure, that they are trying to move from one context window to another or have some kv cache issues or some such, and it doesn't really work. If you fork or cancel in OpenCode there's a chance you see the issue much earlier, which feels like some other kind of hint about kv caching, maybe it not porting well between different shaped systems.
More maliciously minded, this artificial limit also gives them an artificial way to dial in system load. Just not delivering the context window the model has reduces the work of what they have to host?
But to the question: yes compaction is absolutely required. The ai can't even speak it's just a jumbled stream of words and punctuation once this hits. Is manual compaction required? One could find a way to build this into the harness, so no, it's a limitation of our tooling that our tooling doesn't work around the stated context window being (effectively) a lie.
I'd really like to see this improved! At least it's not 60-65k anymore; those were soul crushing weeks, where I felt like my treasured celebrated joyful z.ai plan was now near worthless.
There's a thread https://news.ycombinator.com/item?id=47678279 , and I have more extensive history / comments on what I've seen there.
The question is: will this reproduce on other hosts, now that glm-5.1 is released? I expect the issue is going to be z.ai specific, given what I've seen (200k works -> 60k -> 100k context windows working on glm-5.1).
During off peak hour a simple 3 line CSS change took over 50 minutes and it routinely times out mid-tool and leaves dangling XML and tool uses everywhere, overwriting files badly or patching duplicate lines into files
Starting an hour or two ago GLM's API endpoint is failing 7/8 times for me, my editor is retrying every request with backoff over a dozen times before it succeeds and wildly simple changes are taking over 30 minutes per step.
But it's all casual side projects.
Edit: I often to /compact at around 100 000 token or switch to a new session. Maybe that is why.
For the price this is a pretty damn impressive model.
Providers like DeepInfra are already giving access to 5.1 https://deepinfra.com/zai-org/GLM-5.1
$1.40 in $4.40 out $0.26 cached
/ 1M tokens
That's more expensive than other models, but not terrible, and will go down over time, and is far far cheaper than Opus or Sonnet or GPT.
I haven't had any bad luck with DeepInfra in particular with quantization or rate limiting. But I've only heard bad things about people who used z.ai directly.
Devil's advocate: why shouldn't they do it if OpenAI, Anthropic and Google get away with playing this game?
We all know that building a spec-compliant browser alone is a herculean task.
Would it succeed? Probably not, but it would be way more interesting, even if it didn't work.
I find things like Claude's C compiler way more interesting where, even though CCC is objectively bad (code is messy, generates very bad unoptimized code, etc) it at least is something cool and shows that with some human guideance it could generate something even better.
Excited to test this.
Being "better than Opus 4.6" is not really something a benchmark will tell you. It's much more a consensus of users liking the flavor of an answer, rather than fueling x% correct on a benchmark.
Everyone else isn't that far behind and they aren't all gonna just wall off their new model.
A reason that Anthropic will eventually give is 'the competition can do what Glasswing can do so what's the point limiting it'.
I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.
Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.
https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...
https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...
Who knew Anthropic was this far behind???
For short-term bugfixing and tweaks though, it does about what I'd expect from Sonnet for a pretty low price.
https://github.com/Opencode-DCP/opencode-dynamic-context-pru...
Since the entire purpose, focus and motivation of this model seems to have been "coherency over longer contexts", doesn't that issue makes it not an OK model? It's bad at the thing it's supposed to be good at, no?
It does devolve into gibberish at long context (~120k+ tokens by my estimation but I haven't properly measured), but this is still by far the best bang-for-buck value model I have used for coding.
It's a fine model
as kimi did a huge amount of claude distilation it seems to be somewhat based in data
https://www.anthropic.com/news/detecting-and-preventing-dist...
I'm curious how the bang for buck ratio works in comparison. My initial tests for coding tasks have been positive and I can run it at home. Bigger models I assume are still better on harder tasks.
So I need them to not only not devolve into gibberish, but remain smart enough to be useful at contexts several times longer than that.
I suspect that this isn't the model, but something that z.ai is doing with hosting it. At launch I was related to find glm-5.1 was stable even as the context window filled all the way up (~200k). Where-as glm-5, while it could still talk and think, but had forgotten the finer points of tool use to the point where it was making grevious errors as it went (burning gobs of tokens to fix duplicate code problems).
However, real brutal changes happened sometimes in the last two or three months: the parent problem emerged and emerged hard, out of nowhere. Worse, for me, it seemed to be around 60k context windows, which was heinous: I was honestly a bit despondent that my z.ai subscription had become so effectively useless. That I could only work on small problems.
Thankfully the coherency barrier raised signficiantly around three weeks go. It now seems to lose its mind and emits chaotic non-sentance gibberish around 100k for me. GLM-5 was already getting pretty shaky at this point, so I feel like I at least have some kind of parity. But at least glm-5 was speaking & thinking with real sentances, I could keep conversing with it somewhat, where-as glm-5.1 seems to go from perfectly level headed working fine to all of a sudden just total breakdown, hard switch, at such a predictable context window size.
It seems so so probable to me that this isn't the model that's making this happen: it's the hosting. There's some KV cache issue, or they are trying to expand the context window in some way, or to switch from one serving pool of small context to a big context serving pool, or something infrastructure wise that falls flat and collapses. Seeing the window so clearly change from 200k to 60k to 100k is both hope, but also, misery.
I've been leaving some breadcrumbs on Bluesky as I go. It's been brutal to see. Especially having tasted a working glm-5.1. I don't super want to pay API rates to someone else, but I fully expect this situation to not reproduce on other hosting, and may well spend the money to try and see. https://bsky.app/profile/jauntywk.bsky.social/post/3mhxep7ek...
All such a shame because aside from totally going mad & speaking unpuncutaed gibberish, glm-5.1 is clearly very very good and I trust it enormously.
GLM5 also had this issue. When it was free on Openrouter / Kilo the model was rock solid though did degrade after 100k tokens gracefully. Same at launch with Zai aside from regular timeouts.
Somewhere around early-mid March zai did something significant to GLM5 - like KV quanting or model quanting or both.
After that it's been russian roulette. Sometimes it works flawlessly but very often (1/4 or 1/5 of the time) thinking tokens spill into main context and if you don't spot it happening it can do real damage - heavily corrupting files, deleting whole directories.
You can see the pain by visiting the zai discord - filled with reports of the issue yet radio silence by zai.
Tellingly despite being open source not a single provider will sell you access to this model at anything approaching the plans zai offers. The numbers just don't work so your choice is either pay per token significantly more and get reliability or put up with the bait and switch.
The bar is very low :(
But I used 70m tokens yesterday on glm-5.1 (thanks glm for having good observability of your token usage unlike openai, dunno about anthropic). And got incredible beautiful results that I super trust. It's done amazing work.
This limitation feels very shady and artificial to me, and i don't love this, but I also feel like I'm working somewhat effectively within the constraints. This does put a huge damper on people running more autonomous agentic systems, unless they have Pi or other systems that can more self adaptively improve the harness.
[[you guys, please don't post like this to HN - it will just irritate the community and get you flamed]]
Interesting.
Hopefully these aren't bots created by Z.AI because GLM doesn't need fake engagement.
Thanks for watching out for the quality of HN...
These are different from the submitter-passed-a-link-to-friends kind of upvoting and booster comments, which feel quaint by comparison. In this case people usually don't know they are breaking HN's rules, which is why they don't try to hide it.
There are YC members in the current batch who are spamming us right now [2]. They are all obvious engagement-bait questions which are conveniently answered with references to the SaaS.
[0]: https://www.reddit.com/r/DoneDirtCheap/comments/1n5gubz/get_...
[1]: https://www.reddit.com/r/AIJobs/comments/1oxjfjs/hiring_paid...
[2]: https://www.reddit.com/r/androiddev/comments/1sdyijs/no_code...
"I am the storm that is approaching, provoking..." : )