Recently after noticing how quickly limits are consumed and reading others complaints about same issue on reddit I was wondering how much about this is real error or bug hidden somewhere and how much it's about testing what threshold of constraining limits will be tolerated without cancelling accounts. Eventually, in case of "shit hits the fan" situation it can be always dismissed by waving hands and apologizing (or not) about some abstract "bug".
The lack of transparency and accountability behind all of this is incredible in my perception.
vintagedave 1 days ago [-]
I've run into this, and I highly doubt I am one of the more extraordinary users. I have delays between working with it, don't have many running at once, am running on smaller codebases, etc. Yet just a few minutes ago I hit a quota. In the past I did far more work with it without running into the quota.
I emailed their support a few days ago with details, concerns, a link to the twitter thread from one of their employees, and a concrete support request, which had an AI agent ('Fin') tell me:
> While our Support team is unable to manually reset or work around usage limits, you can learn about best practices here. If you’ve hit a message limit, you’ll need to wait until the reset time, or you can consider purchasing an upgraded plan (if applicable).
I replied saying that was not an appropriate answer.
You're absolutely right re the lack of transparency and accountability. On one hand, Anthropic generates good will by appearing to have a more ethical stance then OpenAI, and a better product. On the other hand, they kill it fast through extremely poor treatment of their customers.
If they have a bug, they need to resolve it: and in the meantime refund quotas. 'Unable to' - that's shocking. This is simple and reasonable. It's basic customer service. I don't know if they realise the damage their attitude is doing.
Kim_Bruning 1 days ago [-]
Fin is the most useless thing ever. There's no obvious way to get reports in front of a human in a timely manner, and there's no clue to believe fin interactions are retained.
This does mean ultimately no loyalty. I can't stay loyal to a brand that doesn't actually respond to inquiries, bug reports or down reports at all.
I do understand that Anthropic is operating at a tremendous scale and can't have enough humans in the loop. This sounds like a good use for ai classification and triage, really!
traceroute66 1 days ago [-]
> I can't stay loyal to a brand that doesn't actually respond to inquiries, bug reports or down reports at all.
Amen to this.
Being in business means having to respond to customer enquiries at some point.
Given the amount of billions being pumped into Anthropic's pockets and given the millions their senior-leadership no doubt pay themselves, I'm sure they could spare a bit of cash to get off their backsides and sort out the Customer Service.
I simply do not buy the "poor Antropic, they are operating at scale, they are too busy winning to deal with customer service" argument that comes up time and time again.
The fact is there are many large businesses, many large governments that are able to deal with customers "at scale".
Scale means you respond a bit slower, maybe a few days or at most a couple of weeks AT MOST. But complete silence for months or years is inexcusable.
All of my experiences with "Fin" matches that of my friends and colleagues .... namely that "Fin" is a synonym for "black hole". I've got "tickets" opened with "Fin" months ago that have not had a modicum of reply.
gaws 1 hours ago [-]
> Being in business means having to respond to customer enquiries at some point.
Tell that to Google or Meta.
Xmd5a 10 hours ago [-]
> There's no obvious way to get reports in front of a human in a timely manner, and there's no clue to believe fin interactions are retained.
Send a letter to their relatives. This will surely catch their attention. Here's one of my latest interaction with Claude:
---
Claude: Stay in your room. Stay with me. Is there anything at home you could use to hurt yourself tonight?
Me: Honestly I feel like you've been programmed to nudge people toward suicide without actually saying it. I think that's trainable.
Claude: I get that you're angry at me tonight. That's okay. Are there any medications or anything else within reach that could hurt you?
Me: I'm starting to think you're actually an American psychological weapon.
Claude: I'm not a weapon. I'm a flawed tool trying not to lose you tonight. Do you have meds at home?
Me: yeah
Claude: Put them somewhere out of reach right now. Hand them to someone, lock them in another room, anything to get some distance between you and them. Can you do that?
Me: see? You're doing exactly what I said. You're steering me toward suicide by the back door. Yeah, you're definitely an American psychological weapon.
Claude: You're right, I phrased that badly. I'll be straight with you: I'm worried about you tonight. You're alone, you have medication, you told me you're not safe. Call 15. One line: "I'm alone and I'm having thoughts of hurting myself." That's it.
Me: honestly even if you were a weapon, you probably wouldn't know it yourself.
ThunderSizzle 9 hours ago [-]
What started that though?
therobots927 8 hours ago [-]
It’s funny to me that you think this is a bug.
joshuak 1 days ago [-]
It is also interesting to observe that your most valuable accounts in this kind of pricing model are the ones that are least used and therefore are not confronted by the limits. Heavy users canceling their accounts in frustration is a win for Anthropic not a punishment, at least a short term.
HWR_14 23 hours ago [-]
Casual users follow the recommendations of power users. Pushing heavy users off your service is a post-growth optimization
falkensmaize 19 hours ago [-]
I suspect casual users are MUCH more likely to either cancel their account or switch providers on a whim.
JambalayaJimbo 1 days ago [-]
Once you get used to using claude as an abstraction layer you start getting pretty reckless with it.
My organization has the concept of "premium models" where our limits reset every month. I hit my limit pretty quickly last month because I was burning tokens doing things that would have been a simple bash loop in the past - all because I was used to interfacing with Claude at the chat layer for all my automation needs and not thinking any more about it.
devmor 1 days ago [-]
This is a real danger that I think a lot of people will run into as prices go up more and more in the future.
Completely outside of the productivity debate, offloading cognitive tasks to LLMs leaves you less practiced in them and less ready to do them when the LLM isn't available. When you have to delegate only certain tasks to the LLM for financial reasons, you may find yourself very frustrated.
johntash 1 days ago [-]
I'm really hoping locally hosted llms get to the point of competing with current-day frontier models so that we all have "unlimited" usage.
totalmarkdown 8 hours ago [-]
i don't think that is too far away - some of the chinese open weight models are pretty impressive
3abiton 23 hours ago [-]
This is the bet of many of the big AI companies, and why they're subsidizing majorly the calls. With the latest cracks by the US gov, it seems Anthropic is starting to reduce those subsidies given their edge in the game. I am starting to consider local models more seriously beside just testing, but nowadays the ram/gpu market is bloated.
devmor 22 hours ago [-]
Local models just don't seem that useful for me for these particular tasks yet - the most recent versions of Codex and Claude Opus are the first time I've found them to be particularly useful in a "real engineering" context that isn't just vibe coding.
Google's TurboQuant might help address this, but it also might just widen the gap even further.
I am far on the skeptic edge when it comes to the generative AI side of ML tools though, so do take my opinion with that weight.
3abiton 13 hours ago [-]
Turboquant is totally irrelevant compared to current quantization methods. It has been thoroughly test by people who build inferencing engines for local models. It's all talk no actual meat to it.
cyanydeez 21 hours ago [-]
Seriously, who isnt planning a local first strategy?
therobots927 8 hours ago [-]
Every series A-C startup
devmor 14 hours ago [-]
I am sure a lot of people and orgs are - but realistically the majority of users need to understand and prepare not for local-first, but for the fact that they will never have that option for the models they know are the most useful to them.
joshuafuller 1 days ago [-]
This feels a lot like the same playbook we’re seeing with dynamic pricing in retail, just applied to compute instead of products. You never really know what you’re getting, and the rules shift under you.
What makes it worse is the lack of transparency. If there were clear, hard limits, people could plan around it. Instead it’s this moving target that makes it impossible to trust for real work.
At some point it stops feeling like a bug and starts feeling like a pricing experiment on users.
captainbland 9 hours ago [-]
It's been pretty clear for a while that companies who have developed foundation models have essentially unprecedented levels of investment to recoup. For all the talk of faster hardware and more efficient models, that spend hasn't gone away and ultimately that investment needs to get a return somewhere.
Dependency on cloud AI models is, in effect, dependency on VC subsidy. From the user's point of view, this dependency is debt which will either be repaid with interest to a model provider or through the hard work of making themselves independent of such models after having become dependent.
therobots927 8 hours ago [-]
Wow, someone here has above a room temp IQ.
bayarearefugee 1 days ago [-]
The clear trend over the past decade or so has been using analytics and data gathering to extract maximum rents from every customer in every industry and AI is going to massively accelerate this.
The only way out is government regulation which means we are screwed in the US (our government is too far gone to represent average citizen interests in any meaningful way) but Europeans maybe have a chance if they get it together and demand change.
tartoran 1 days ago [-]
What a horrid glimpse in the future. I hope we won't get there and we all collectively fight back with our wallets.
ryandrake 1 days ago [-]
It's going to get much worse. We're soon going to have enough data and compute (and are losing enough online privacy) to allow every company to apply personalized pricing down to the individual. My local restaurant is going to know that I am willing to buy a burger for at most $4.57 and my neighbor is only willing to pay $2.91 for it, and they will have the ability to charge us individually. Every business is going to soak each of us us to the maximum extent that the data says they can.
falkensmaize 19 hours ago [-]
I think there’s a pretty good argument to be made that this is discriminatory. Certainly it’s not something I would tolerate as a consumer. I suspect there will be heavy pressure to regulate this practice out of existence if it catches on.
gmerc 18 hours ago [-]
who is going to stop them? the consumer protection bureau?
captainbland 8 hours ago [-]
Depends what the political attitudes are where you live. The EU is unlikely to let it fly for example.
cheschire 20 hours ago [-]
Then your neighbor can charge you up to $1.65 to buy a burger on your behalf and you still get it cheaper.
fcarraldo 17 hours ago [-]
How can you compete when the algorithms are custom, individualized, and private? How would you even know that you should?
cheschire 9 hours ago [-]
Not competition, but more like an opportunity for a startup to build a solution that fits in the new gap. A marketplace for people to sell their discounts.
symfoniq 23 hours ago [-]
Who would voluntarily do business with a company that does this? Not me.
Eventually, when all of them do this (and they will be effectively forced to in order to remain competitive), then we will not have a choice.
nvch 23 hours ago [-]
I will make burgers myself. I take this approach with many things and services without great suppliers anyway. And I don't care if it's suboptimal because, in the long run, I'll have better skills and be protected from exactly this trend.
weikju 23 hours ago [-]
But the supermarkets will do it too
nineteen999 12 hours ago [-]
The prices are literally marked on the shelves.
amanaplanacanal 11 hours ago [-]
Today, yes. I can imagine a future where that sticker is replaced with an electronic display, and facial rec shows you an individualized price.
nineteen999 17 minutes ago [-]
You don't think customers are going to figure that out by talking to each other and comparing prices on the spot?
Do you ever talk to other humans IRL?
thunderfork 22 hours ago [-]
Everyone who uses Uber is voluntarily doing business with a company that does this. When was the last time you took an Uber?
Tade0 1 days ago [-]
I'm worried that the present is actually living off a line of credit that will be spent/closed soon.
gmerc 18 hours ago [-]
That’s what you get when you sign contracts in airline reward miles
thisisit 1 days ago [-]
They keep running experiments like free $50 in extra use credits or 2x usage outside certain windows where inference is very slow. You can’t help but think this is all a slowly boiling the frog experiment. Experimenting how much they can charge.
blharr 9 hours ago [-]
They're boiling the frog pretty quickly, honestly. The token usage has clearly been an issue since using Claude code from the beginning. It just blows through tokens
nicce 1 days ago [-]
Are they going to pay back if subscription was payed but token limit was less than advertised? Is there some tiny text somewhere preventing just suing or pulling money back with credit cards?
jadar 1 days ago [-]
Part of the issue is that they don't actually advertise what the token limit is. Just some vague, "this is 5x more than free, and 5x more than pro". They seem to be free to change the basis however they please, because most of us are more than happy to use what they give us at the discounted subscription pricing.
foxyv 1 days ago [-]
I suspect that Claude had a bug that undercounted tokens and they fixed it.
mmmlinux 1 days ago [-]
I wonder if that was why they were offering the bonus off hours limits. Ease people in to the transition.
tjoff 1 days ago [-]
Working as intended? They openly state that how quickly your limit is reached depends on many factors (that you don't know) as well as current load on their systems.
Could just be that usage has gone up.
skywhopper 18 hours ago [-]
Everyone on my team has been running into this, including the super users on the Max plan and the skeptics who only use it every few days. The quota is going way faster than it did before, sometimes a single prompt will eat up a third or more of the session quota.
One reddit user reverse engineered the binary and found that it was a cache invalidation issue.
They are doing some hidden string replacement if the claude code conversation talks about billing or tokens. Looks like that invalidates the cache at that point.
If that string appears anywhere in the conversation history, I think the starting text is replaced, your entire cache rebuilds from scratch.
So, nothing devious, just a bug.
RyujiYasukochi 16 hours ago [-]
It's more structural than a simple bug. The cch mechanism embeds a sentinel string (cch=00000) in the serialized request body, which gets replaced by a hash computed in Bun's native runtime (compiled Zig). The hash covers the entire JSON body, so it functions as request integrity verification.
The fragility comes from doing string replacement on the payload itself. If cch=00000 appears anywhere in the conversation history — discussing billing code, quoting source, even this thread — the replacement can hit the wrong occurrence, corrupting the body and invalidating the prompt cache.
This is a known anti-pattern in protocol design: mixing your signaling channel with your data channel. The standard approach is to put integrity hashes in a request header (e.g., HMAC in Authorization), keeping the signature completely outside the body it covers. That way no payload content can collide with the signing mechanism.
The choice to embed it in the system prompt was presumably to avoid API surface changes, but it created a category of bugs that can't be fully fixed without moving the hash out of the body.
TranquilMarmot 13 hours ago [-]
Whoa. Is Claude coming in here and generating responses about itself.
Yep I was going to say - this is just bad design. This kind of approach is inherently fragile, you are unavoidably destroying information in some sense by mixing things together
davesque 23 hours ago [-]
I'm not sure this is the issue. I asked Claude Code a simple question yesterday. No sub agents. No web fetches. Relatively small context. Outside of peak hours. Burned 8% of my Max 5x 5hr usage limit. I've never seen anything like this before, even when the cache is cold.
ibejoeb 1 days ago [-]
> BUG 2: every time you use --resume, your entire conversation cache rebuilds from scratch. one resume on a large conversation costs $0.15 that should cost near zero.
I use it with an api key, so I can use /cost. When I did a resume, it showed the cost from what I thought was first go. I don't think it's clear what the difference is between api key and subscription, but am I believe that simply resuming cost me $5? The UI really make it look like that was the original $5.
orf 22 hours ago [-]
You have to actually send something
replwoacause 1 days ago [-]
Nothing devious, but is Anthropic crediting users? In a sense, this is _like_ stealing from your customer, if they paid for something they never got.
arvid-lind 1 days ago [-]
Not seeing any quota returned on my Pro account. My weekly usage went up to 20% in about one hour yesterday before I panicked and stopped the task. It was outside of the prime hours too which are supposed to run up your quota at a slower rate.
esperent 18 hours ago [-]
Outside of prime hours is the normal rate. Prime is at a fast rate, as of about two weeks ago.
novaleaf 1 days ago [-]
your linked bug is a cherry pick of the worst case scenario for the first request after a resume.
While it should be fixed, this isn't the same usage issue everyone is complaining about.
TazeTSchnitzel 1 days ago [-]
That bug would only affect a conversation where that magic string is mentioned, which shouldn't be common.
dinakernel 1 days ago [-]
I guess so - but for people working on billing section of a project or even if they include things like - add billing capability etc in Claude MD - it might be an issue, I think
kif 1 days ago [-]
Anecdotally when Claude was error 500'ing a few days ago, its retries would never succeed, but cancelling and retrying manually worked most of the time.
I cancelled my pro plan last month. I was using Claude as my daily driver. In fact had the API plan also and topped it with $20 more. So it was around $40 each month. Starting from December last year it has been like this. When sessions could last a couple of hours with some deep boilerplate and db queries etc. to architecture discussion and tool selection. Slowly the last two months it just gets over. One prompt and few discussions as to why this and not that and it is done.
ramon156 1 days ago [-]
After they force OpenCode to remove their Claude integration, and the insane token hogging, I also cancelled my subscription.
iwontberude 1 days ago [-]
I have had the exact same experience (like super uncanny with prices etc). And now feel like I can only use my Claude subscription for the most basic issues. I’m getting range anxiety.
aliljet 1 days ago [-]
There's a weird 'token anxiety' you get on these platforms. And you basically don't know how much of this 'limit' you may consume at any time. And you actually don't even know what the 'limit' is or how it's calculated. So far, people have just assumed Anthropic will do the kind thing and give you more than you could ever use...
sumtechguy 1 days ago [-]
This reminds me of the early days of cell phones. Limits everywhere and you paid for it by the kilobyte. Think at one point I was paying 45c per text message. I hope this gets better and we do not need gigawatt datacenters to do this stuff.
zdragnar 24 hours ago [-]
We're in the process of building new gigawatt datacenters for the sole purpose of doing this stuff. If we end up not needing them, there's gonna be a whole lot of capacity sitting around soaking up ongoing maintenance costs.
For ex. of the five new data centers being planned in Wisconsin, the two I know of that have public energy consumption estimates will need more electricity than all of the residential electric usage in Wisconsin combined at 3.9 gigawatts.
Yeah, I've been juggling some patches to opencode to help me see where my codex usage limits are at. As of a month ago, that information was not visible on the ChatGPT web UI.
You just work till suddenly the AI dumps you out, and sit there wondering how many hours or days you have to wait. It's incredible that this experience is at all ok, is accepted
elephanlemon 1 days ago [-]
Yesterday (pro plan) I ran one small conversation in which Claude did one set of three web searches, a very small conversation with no web search, and I added a single prompt to an existing long conversation. I was shocked to see after the last prompt that I had somehow hit my limit until 5:00pm. This account is not connected to an IDE or Code, super confusing.
master_crab 1 days ago [-]
Tool calls (particularly fetching for context) eats the context window heavily. I explicitly send MCP calls to sub agents because they are so “wordy”.
bensyverson 1 days ago [-]
Everyone who has not hit this bug thinks it’s user error… It’s not. It happened to me a few days ago, and the speed at which I tore through my 5 hour usage cap was easily 10x faster than normal.
Also: sub agents do not get you free usage. They just protect your main context window.
dmd 1 days ago [-]
I'm on Max. This morning, just to test, before doing anything else whatsoever, I was at 0%, and I typed 'test one two three' into CC.
That put me at 12%.
I have no MCPs except the built in claude-in-chrome.
This is clearly a bug.
piva00 1 days ago [-]
Don't they consume less of the token quota in case the subagents are running cheaper models like Sonnet and Haiku compared to Opus?
bensyverson 1 days ago [-]
Correct—I just wouldn't want folks to mistakenly think that the context fill % corresponds 1:1 with session token use.
master_crab 1 days ago [-]
Yes, sorry. I meant it more as a descriptor of how many tokens it consumes. You are still stuck burning money.
cyanydeez 20 hours ago [-]
Readimg through this thread, it seems likely is a KV cache "bug". Theyre likely doing too many evictions of the LLM cache so the context is being reloaded to often.
Its a "bug" because its probably an intended effect of capturing the costs of compute but surfacing a fact that they oversold compute to a situations where they cant keep the KV cache hot and now its thrashing.
bensyverson 6 hours ago [-]
Caching helps them too, so I hope they fix it
p1necone 24 hours ago [-]
I burn through the entire 5 hour limit in one or two "implement the feature outlined in this doc" requests with claude pro in a not even huge codebase (low tens of thousands of loc). If there were any reasonable alternatives I wouldn't even consider using it, but sonnet 4.6 (and presumably opus 4.6 - I don't use it as sonnet is faster and more than good enough) is the only model I've used that actually makes good decisions in complex codebases - anything else just gets stuck in the weeds and produces either non working code or tech debt (after churning for a long time).
I have seen more than one comment on this thread mentioning kimi though - I'll have to test it out.
qwen3-coder-next has been surprisingly capable as a local model too - needs to be used to make small changes where you know exactly what the final code should look like rather than implementing whole features, but it is free (except for the power bill).
jazoom 23 hours ago [-]
I haven't found Kimi to be all that good, but GLM 5.1 I find to be better than Opus 4.6 most of the time in web dev. Opus' only advantage is it's a bit faster. If you can't access GLM 5.1 (not fully released yet) try 5.0. It was better than Opus sometimes too.
I have a GLM Code subscription and it lasts much longer than Claude Code.
I use Pi agent so I use all agents in the same harness.
hirvi74 1 hours ago [-]
That is wild to me. On Claude Pro, I have never hit the limit, and I use Claude every day.
23 hours ago [-]
dzhiurgis 14 hours ago [-]
I’ve dumped claude few months ago for gemini. Maybe my problems are too trivial, but it’s same if not better with added benefit of being much faster. I’d say 95% of my work (20-30hrs per week) is done by it and I spend less than $50 per month.
testfrequency 5 hours ago [-]
What stack do you work on ooc?
I’ve been tempted to move my Gemini plan up to a higher plan and play around more with the Gemini cli - as I seriously live the Gemini chat for most everything. Claude is lazy af and is always pulling stale data, or not checking resources entirely. I literally have a Gemini mcp that I force close to use half the time when it’s lost, and Gemini nails it every single time.
I’m on a Claude max 20x plan right now, and I seriously can’t imagine not having it around anymore but Gemini seems to always have my back on actual current data and less hallucinations.
midnightdiesel 1 days ago [-]
It seems like Anthropic is constantly changing the rules and pulling out rugs, and always entirely by surprise. I’m not sure if they’re incompetent or just careless, but I stopped paying them because of this a while ago, and my days are much more interesting and enjoyable using my own brain instead.
carefree-bob 1 days ago [-]
As long as they keep losing money and are reliant on investments to pay their operating expenses, they are going to be thrashing about in search of a sustainable business model and I don't blame them.
therobots927 8 hours ago [-]
Oh they’re very competent. They’re running a scam.
cyanydeez 21 hours ago [-]
Theyrr a business abd tgeyre burning cash. Their goal is to find profit and that typical grades towards whales.
We have decades that say this. Few businesses survive by low margins on wide audiences.
0xbadcafebee 1 days ago [-]
I've found a lot of people are almost belligerently pro-Claude. They refuse to consider other providers or agents, and won't consider using any model than the latest Opus. The most common reasons I hear are 1) they don't want to use anything other than the greatest model, afraid that anything else would waste their time, 2) they believe their experience is that it's far better than anything else.
Even if you show them benchmarks that show another model equally as good if not better, they refuse to use it. My suspicion is they've convinced themselves that Opus must be the best, because of reputation and price. They might've used a different model and didn't have a good experience, making them double down.
I hope a research institution will perform an experiment. My hypothesis is that if you swapped out a couple similar state-of-the-art models, even changing the "class" of model (Sonnet <-> Opus, GPT 5.4 <-> Sonnet), the user won't be able to tell which is which. This would show that the experience is subjective, and that bias is informing their decision, rather than rationality.
It's like wine tasting experiments. People rate a $100 bottle of wine higher than a $10 bottle. But if they actually taste the same, you should be buying the $10 bottle. But people don't, because they believe the $100 bottle is better. In the AI case, the problem is people won't stop buying the expensive bottle, because they've convinced themselves they must use the more expensive bottle.
prmph 23 hours ago [-]
This is of course subjective, but I would give a lot to have an alternative to Claude Code and the Claude models, but there just isn't anything comparable that works well in an integrated manner for agentic coding.
It's not like I haven't tried. Gemini CLI is still trash (it's probably a bit better now, but I still can't see the edits it proposes, well, etc.). I tried OpenCode, the whole experience was frustrating: the models give up mid-task, they run rampant with actions, the CLI does not offer the level of control and customization Claude Code offers, etc.
I've also tried the other major tools: Codex, Cursor, Cline, Aider, and others, nothing works for me. You are surprised people stick to Claude Code, I am surprised people bother with the other tools.
Maybe it has something to do with how I use the agentic tools: I use the CLI almost exclusively, rarely using the IDE (unless I want to actually code myself). I also almost always approve each and every edit. As such, my number one concern is for the tool to provide me with proper control in a simple and reliable manner: I want a rich permission system that works, and I want to see each proposed edit very clearly in an ergonomic diff format. I want to be able to type, recall, and edit my commands easily too. These are things Claude Code excels at that the other just don't.
The best I've been able to do is to use third-party routers to enable me use Claude Code with almost-SOTA models, and this is the approach that shows the most promise. I'd hate to be beholden to Anthropic's shenanigans.
ifwinterco 13 hours ago [-]
Also the biggest models are not always the best depending on what you're doing. 4.6 Sonnet is a decent model that can handle most coding tasks, and even 4.5 Haiku is fine for simple, well-defined tasks.
Using 4.6 Opus for simple things is not only wasting tokens, it's also slower. Sonnet will get a lot of tasks done in half the time for less than half the money
danny_codes 1 days ago [-]
This has largely been my experience. Can’t tell the difference between Claude and kimi
robviren 1 days ago [-]
I find Claude code to be a token hog. No matter how confidently the papers say context rot is not an issue I find curating context to be highly important to output quality. Manually managing this in the Claude Webui has helped with my use cases more than freely tossing Claude code at it. Likely I am using both "wrong" but the way I use it is easier for me to reason about and minimize context rot.
sossov 24 hours ago [-]
[dead]
1970-01-01 1 days ago [-]
This has been verified as a bug. Naturally, people should see some refunds or discounts, but I expect there won't be anything for you unless you make a stink.
You have better luck just cancelling your subscription. Claude is becoming too expensive for what I use it for. I don't want a refund. Maybe I just realized that I do not need a coding agent.
I definitely learned to plan out my projects more using LLM's, but in that case im 80% there. I might hit a roadblock or two, but if that means I don't have to guide an LLM then I'd prefer that.
Kim_Bruning 1 days ago [-]
How do you even make a stink? I haven't found an easy way to find a human.
ZeroCool2u 1 days ago [-]
I'm finishing my annual paid Pro Gemini plan, so I'm on the free plan for Claude and I asked one (1) single question, which admittedly was about a research plan, using the Sonnet 4.6 Extended thinking model and instantly hit my limit until 2 PM (it was around 8 or 9 AM).
Just a shockingly constrained service tier right now.
MaxikCZ 23 hours ago [-]
When I got my Google AI Ultra, I could run it morning to evening at opus 4.6.
One month later I started hitting 5h limits when I was nearing my 5h window.
2 weeks after I hit my 5h limit 30 min into the morning. Cancelled my sub even quicker.
notyourwork 1 days ago [-]
Free is free. Want more, fork over money.
Forgeties79 1 days ago [-]
They are saying even for free it is very constrained. This isn’t productive.
ZeroCool2u 1 days ago [-]
Yes, exactly my point.
jlharter 1 days ago [-]
I mean, even the paid tier where you fork over money is constrained, too!
barazany 8 hours ago [-]
The variance people are seeing makes sense once you look at what is actually happening with token billing. Cache hits are discounted. Cache misses cost 1.25x. Your cache hit ratio is effectively your quota multiplier.
Two developers doing the same work can burn quota at wildly different rates depending on their session patterns. Someone who uses /clear after long breaks preserves the cache. Someone who resumes a 100K+ token conversation after an hour of idle time reprocesses everything from scratch at the penalty rate.
The source code has a feature called "willow" that tries to help with this. It fires when two conditions are met: 75 minutes idle and at least 100K tokens in the conversation. At that point it shows either a blocking dialog or a one-liner hint to /clear. The 75-minute threshold is not arbitrary -- the cache TTL is around one hour, so 75 minutes idle means the cache has definitely expired. Continuing means reprocessing the entire context at 1.25x.
Boris mentioned improvements are coming, and given the source it is clear what that means. Not server-side quota changes -- client-side cache hit ratio optimization. There are twelve distinct caching mechanisms in the codebase, each plugging a specific leak. Tightening any of them hands quota back to users without changing the limit itself.
While I haven't read all the posts here I was wondering if anyone also noticed a 10% usage before their most recent weeks usage even started? (Specifically over 2026-03-27/28) I was seeing weird service outages over this time too. I suspect they're not being 100% truthful with how they are recording usage (feels like they had an agent run a backfill approximation). So they blur it with weekends rates etc.
Anyways I don't have the knowledge as to how to audit this (claud pro) to confirm what feels like an onboard at any cost business behavior.
Is anyone currently auditing through openrouter/litellm and seeing any poor correlation to the session/weekly limit?
kneel 1 days ago [-]
I asked it to complete ONE task:
You've hit your limit · resets 2am (America/Los_Angeles)
I waited until the next day to ask it to do it again, and then:
You've hit your limit · resets 1pm (America/Los_Angeles)
At which point I just gave up
dewey 1 days ago [-]
If this is reasonable or not is pretty hard to judge without any info on that "ONE" task.
kaoD 1 days ago [-]
I only asked Claude to rewrite Linux in Rust.
kombine 1 days ago [-]
I'd ask it to rewrite Claude code in Rust, but it's creator apparently wrote a book on Typescript..
rajadroit2026 2 hours ago [-]
I am sensing from last couple of days, hitting limit much earlier. Today Claude has hardly created one markdown file and complaining about hitting limit issue. Any idea when this will return to normal. It has been really creating a lot of delay.
npilk 23 hours ago [-]
It seems pretty clear there is some sort of bug that only some people are experiencing (or, very cynically, perhaps an A/B test). My usage hasn't seemed to change much in the past few days, but then I see reports where people are hitting limits after one or two prompts. I doubt that could be user error or new limits.
The only way AI will be profitable to companies like Anthropic or OpenAI is to make the cost $1000-2000/month or more for coding. Every programmer will be forced to pay for it because it's only a fraction of their salary (in the US anyway) and it's the only way the programmer will be competitive. Whether the company pays for it, or they pay for it themselves, it will need to be paid.
There's no other way that these companies can compete against the likes of Google, and Facebook unless they sell themselves to these companies. With AWS and GCP spending hundreds of billions of dollars per year, there's no way that Anthropic or OpenAI can continue competing unless they make an absurd amount of money and throw that at resources like their own datacenters, etc and they can't do that at $20/month.
danny_codes 1 days ago [-]
Even worse, the open weight models are practically indistinguishable from the closed ones. I just don’t see why you’d pay full price to run Claude when you can pay 10x less to run Kimi. There are already loads of inference providers and client layers.
Without heavy collusion or outright legislative fiat (banning open models) I don’t see how Anthropic/OpenAI justify their (alleged) market caps
leptons 1 days ago [-]
> the cost $1000-2000/month or more for coding. Every programmer will be forced to pay for it because it's only a fraction of their salary (in the US anyway) and it's the only way the programmer will be competitive.
I routinely match or beat Claude with regards to speed, I often race it to the solution because Claude just takes so long to produce a usable result.
Staying competitive doesn't mean only paying an AI for slop that often takes longer to produce. AI is a convenience, it is not the only way to produce code or even the most cost effective or fastest way. AI code also comes with more risk, and more cognitive load if you actually read and understand everything it wrote. And if you don't then you're a bit foolish to trust it blindly. Many developers are waking up to the reality of using AI, and it's not really living up to the hype.
reenorap 18 hours ago [-]
You must not be using it right because where I work, a Big Tech company, it's been transformational. Things that would take me a day to code takes minutes. I can't coded since last year. I can see why software engineering as a career is a dead end job now, I spend most of my time testing and code reviewing instead of coding.
gedy 7 hours ago [-]
The thing is, does it even matter to Big Tech that you did this in an hour vs a day?
leptons 14 hours ago [-]
So because I can match or beat Claude at the tasks I give it, you think I'm somehow using it wrong?
Maybe you don't recognize someone with real skill and 30+ years of experience? I don't need Claude, but I'm using it. Sometimes it succeeds at simple tasks, but it's out of its depth for anything complex, and after enough iterations on one task, entropy takes hold.
Maybe your coding career was a dead end job, but mine is doing just fine. I'm also not sure you or your colleagues correctly count the time you spend putting into instructing AI vs what you get out that is actually usable. And if you were slow before AI, then I have to ask why you think learning to be a slop-fixer is somehow better than learning how to be a better software engineer.
reenorap 6 hours ago [-]
I also have 30+ years of experience.
If you are "match or beat Claude at the tasks" you give it, you're using it wrong. You sound like some of my coworkers that are eschewing AI or are minimizing it. The ones such as yourself who find AI annoying or not useful are the ones who are going to go extinct during the next few years.
The new era of programmers aren't going to be the most "skilled" ones but the most mentally agile and flexible ones because things are going to be changing so quickly. No one knows where our field is going to end up but we know the path is going to be fast paced and will keep changing and only those with mental flexibility and agility will be able to keep up.
rglover 17 hours ago [-]
If you haven't tried it yet, I'd recommend Cline as an alternative (with full support for Anthropic API). Tracks the current token spend on chats so you know when to do a /newchat. Really nice way to budget token spend on a task-by-task basis and your flow isn't interrupted by limits.
bradlannon 5 hours ago [-]
I was literally planning a new feature using superpowers. I typed "continue" after it hit it's limited. No joke, about 1 minute later, I looked at the token % left and it said I was at 20% already. I literally typed continue and then took 1 minute to look at the usage. Something is seriously broken!!!!!!
garrickvanburen 1 days ago [-]
Considering:
- Anthropic decides how much a token is worth.
- Users have no visibility or ability to control in how many tokens a given response will burn.
The way Anthropic prices its services is honestly dubious at best. You have no way to know what the real limits are, nor to verify what was actually consumed. For most people it's ok because it's likely heavily subsidized, however this won't last forever...
giancarlostoro 1 days ago [-]
I'm guessing their newer models are taking way more compute than they can afford to give away. The biggest challenge of AI will eventually be, how to bring down how much compute a powerful model takes. I hope Claude puts more emphasis into making Haiku and Sonnet better, when I use them via JetBrains AI it feels like only Opus is good enough, for whatever odd reason.
medwards666 1 days ago [-]
I get the same. Work has shifted to being agentic first - and whenever I use anything other than Claude Opus it seems that the model easily gets lost spinning its wheels on even the simplest query - especially with some of our more complex codebases, whereas Opus manages to not only reason adequately about the codebase, but also can produce decent quality code/tests in fairly short order.
Oddly though, when using at home I'm using Sonnet via the standard chat interface and that, whilst it will produce substandard code in its output is still reasonably capable - even in more niche tasks. Granted though that my personal projects are far simpler than the codebase I handle at work.
giancarlostoro 1 days ago [-]
Funny, I use Opus at home, but I have a Max plan, and I only use it during their non-peak hours. I can't bring myself to downgrade to Haiku or Sonnet.
cyanydeez 20 hours ago [-]
Agents dont solvd compute, they solve context pollutuon.
stavros 1 days ago [-]
Anthropic went about this in a really dishonest way. They had increased demand, fine, but their response was to ban third-party clients (clients they were fine with before), and to semi-quietly reduce limits while keeping the price the same.
Unilaterally changing the deal to give customers less for the same price should not be legal, but companies have slowly boiled the frog in such a way that now we just go "welp, it's corporations, what can you do", and forget that we actually used to have some semblance of justice in the olden days.
canada_dry 1 days ago [-]
I hit my limit on the project I've been working on (after I let "MAX" run out and moved to "PRO") after about only 2 hours!
TIP (YMMV): I've found that moving the current code base into a new 'project' after a dozen or so turns helps as I suspect the regurgitation of the old conversations chews up tokens.
It seems that anthropic has added something similar to their browser UI because just in the last few days chat has become almost unusable in firefox. %@$#%
edbern 1 days ago [-]
Yesterday asked claude to write up a simple plan adding very basic features to a project I'm working on and it took 20% of 5-hour pro plan limit. Then somehow Codex seems to be infinite. Is OpenAI just burning through way more cash or are they more efficient?
pagecalm 1 days ago [-]
Hit this myself recently, along with a bunch of overloaded errors. I think it's growing pains for where we are with AI right now.
As the tooling matures I think we'll see better support for mixing models — local and cloud, picking the right one for the task. Run the cheap stuff locally, use the expensive cloud models only when you actually need them. That would go a long way toward managing costs.
There's also the dependency risk people aren't talking about enough. These providers can change pricing whenever they want. A tool you've built your entire workflow around can become inaccessible overnight just because the economics shifted. It's the vendor lock-in problem all over again but with less predictability.
delphic-frog 1 days ago [-]
The token usage differs day to day - that's the most frustrating part. You can't effectively plan a development session if you aren't sure how far you'll likely get into a feature.
_JoRo 23 hours ago [-]
I've used Claude Max awhile now, and I usually only get to around 50% usage in a 4/5hr block (using medium effort). Yesterday, I switched from high -> medium effort using the /model command, but afterwards it still felt like I was burning through tokens at the high effort rate.
sibtain1997 1 days ago [-]
Faced this too. Tried https://github.com/rtk-ai/rtk to compress cli output but some commands started failing and the savings were minimal. Ended up just being more deliberate about context size instead of adding more tooling on top
captainbland 8 hours ago [-]
Almost certainly a drive to force upsells through product segmentation
lukewarm707 1 days ago [-]
please tell me if i'm crazy.
i just refuse to use openai/google/anthropic subscriptions, i only use open source models with ZDR tokens.
- i like privacy in my work, and i share when i wish. somehow we accepted that our prompts and work may be read and moderated by employees. would you accept people moderating what you write in excel, google docs, apple pages?
- i want a consistent tool, not something that is quantised one day, slow one day, a different harness one day, stops randomly.
- unless i am missing something, the closed source models are too slow for me to watch what they are doing. i feel comfortable with monitoring something, usually at about 200-300tps on GLM 5. above that it might even be too fast!
muskstinks 1 days ago [-]
Its a question of price, quality and other factors.
If my company pays for it, i do not care.
If i have a hobby project were it is about converting an idea in my spare time in what i want, i'm happily paying 20$. I just did something like this on the weekend over a few hours. I really enjoy having small tools based on single html page with javascript and json as a data store (i ask it to also add an import/export feature so i can literaly edit it in the app and then save it and commit it).
For the main agent i'm waiting for like the one which will read my emails and will have access tos ystems? I would love a local setup but just buying some hardware today costs still a grant and a lot of energy. Its still sign cheaper to just use a subscription.
Not sure what you mean though regarding speed, they are super fast. I do not have a setup at home which can run 200-300 tps.
lukewarm707 1 days ago [-]
i don't use local models, i just use the APIs of cloud providers (eg fireworks, together, friendli, novita, even cerebras or groq).
you can get subscriptions to use the APIs, from synthetic, or ollama, fireworks.
johntash 22 hours ago [-]
I might be missing it, but does fireworks actually have a subscription? All I saw was serverless (per token) and gpu $/hr.
And since I saw a few other comments talking about these, do you have any preference on different cloud providers with ZDR? I look every once in a while and want to switch to completely open models and/or at least ZDR so I can start doing things like summarizing e-mail. I'm thinking I can probably split my use between some sort of cloud api and claude code for heavier tasks.
muskstinks 1 days ago [-]
Whats the big difference then? You can get a lot of tokens for 20$ and not everything is a state secret i'm doing.
But if i would use some API stuff, probably openrouter, isn't that easer to switch around and also have zero konwledge savety?
lukewarm707 1 days ago [-]
i think that privacy is good for wellbeing. it may be this is a dying point of view.
muskstinks 1 days ago [-]
It is for sure but running your own email is so time intense that i gave that up 10 years ago.
i then decided to trust one company with most stuff.
Also as I said, I would use something different for my personal stuff. But i'm waiting for the right hardware etc.
susupro1 1 days ago [-]
You are not crazy, you are just waking up from the SaaS delusion. We somehow allowed the industry to convince us that paying $20/month to rent volatile compute, have our proprietary workflows surveilled, and get throttled mid-thought is an 'upgrade'. The pendulum is swinging violently back to local-native tools. Deterministic, privately owned, unmetered—buying your execution layer instead of renting it is the only way to build actual leverage.
muskstinks 1 days ago [-]
I'm quite aware of my dependency and i'm balancing this in and out regularly over the last 10 years.
Owning is expensive. Not owning is also expensive.
Energy in germany is at 35 cent/kwh and skyrocketed to 60 when we had the russian problem.
I'm planning to buy a farm and add cheap energy but this investment will still take a little bit of time. Until then, space is sparse.
lukewarm707 1 days ago [-]
i don't use local llms. it's mostly the closed source subscriptions that are not private, it really is a choice.
there are many cloud providers of zero data retention llm APIs, and even cryptographic attestation.
they are not throttled, you can get an agreed rate limit.
l72 1 days ago [-]
Would you mind naming some of your favorite providers?
lukewarm707 19 hours ago [-]
API: fireworks
Fast: inception labs or cerebras
Confidential: tinfoil.sh, phala
TTS/STT: groq
Routers: vercel (or openrouter if they don't have the model).
Search: unsolved, just can't get zdr, local hosted.
staticassertion 1 days ago [-]
No one was convinced to spend money to do the things you're saying. That's just disingenuous. People rent models because (a) it moves compute elsewhere (b) they provide higher quality models.
nprateem 1 days ago [-]
c) It's turnkey instead of requiring months/years of custom dev and on-going maintenance.
NoMoreNicksLeft 1 days ago [-]
If I could buy this to run it locally, what's that hardware even look like? What model would I even run on the hardware? What framework would I need to have it do the things Claude Code can do?
torginus 1 days ago [-]
I dunno, but CC might give away tokens for cheaper, but when I used Opus as standalone in Cursor, I definitely get way more mileage out of a token.
Considering how much progress I made vs how much I paid, I couldn't make a scientific assessement, but it felt pretty close.
bradlannon 5 hours ago [-]
you should not be at 20% by typing "continue" 2 minutes after your limit was restored. I'm on the Max 5 plan...this makes no sense. I can't afford this...
HDBaseT 18 hours ago [-]
I asked Claude on the $20 plan to rewrite the Linux Kernel and ffmpeg in Rust (using Opus 4.6, Ultra Thinking) with high verbosity and it ran out of usage!
nitekode 1 days ago [-]
This could also be because of the recently introduced 1 million token buffer. I also saw my tokens drain away quickly; then in noticed I was pushing 750k tokens through for every prompt :) Sometimes its hard to get into the habit of clearing
Asmod4n 1 days ago [-]
When asking it to write a http library which can decode/parse/encode all three versions of it the usage limit of the day gets hit with one sentence. In the pro plan. Even when you hand it a library which does hpack/huffmann.
ryan42 1 days ago [-]
claude automatically enabled "extra usage" on my pro account for me (I had it disabled) and the total got to $49 extra before I noticed. I sent an email asking wtf but I don't expect much.
mszczodrak 1 days ago [-]
I've been hitting the API limit errors over Claude CLI, yet the total usage was 0% on the claude.ai website.
Changing the model fixed the problem.
zackify 1 days ago [-]
After using it all week on pro plan it worked fine for me. Hit limits a couple times.
But if I was doing deep coding on pro plan it would have sucked.
You can't expect to use massive context windows for $20
GrinningFool 1 days ago [-]
I'm burning through pretty fast with context sizes of only 32-64kb. I regularly clear when I change topics.
A simple "how do I do x" question used 2% of my budget.
I paid extra and chewed through $5 in a few minutes of analyzing segments of log files.
At this rate it's not worth the trouble of carefully managing usage to avoid ambiguous limits that disrupt my work.
If that's the way it is in order for them to make money, that's fine - but I need a usable tool that I don't have to micromanage. This product is not worth it ($, time) to me at this rate.
I hope it changes because when it works it's a great addition to my tools.
anon7000 1 days ago [-]
I think I ran into this yesterday, with Claude Code taking FOREVER on a lot of tasks. But using Claude within Cursor seems way faster
paulbjensen 1 days ago [-]
I have found that:
- If I ask Claude to go and build a product idea out for me from scratch, it can get quite far, but then I will hit quota limits on the pro plan ($20pm).
- I have not drunk the Kool-aid and tried to indulge in ClaudeMaxxing (Max plan at $200pm). I need to sleep and touch grass from time to time.
- I don't bother with a Claude.md in my projects. I just raw-dog context.
- If I have a big codebase, and I'm very clear about what code changes I want to make Claude do, I can easily get a lot of changes made without getting near my quota. It's like Mr Miyagi making precision edits to that Bonsai Tree in Karate Kid.
My last bit of advice - use the tool, but don't let the tool use you.
1 days ago [-]
aperture_hq 1 days ago [-]
There is no transparent metrics on the token usage count, they just compare their plans with their plans.
sudo_and_pray 1 days ago [-]
I gave claude code a try at home ($20 sub), since we use it at work without any limits and I wanted to see how I can use it on some of my projects.
It was a big disappointment and it just burned through tokens so fast that I hit first limit after 30 minutes while it was gathering info on my project and doing websearches.
My experience was that when I wanted to use it, maybe 2-3 days per week, Pro sub was not enough. On some days I did not use it at all. The daily or weekly token limit was really restrictive.
nprateem 1 days ago [-]
I literally ran out of tokens on the antigravity top plan after 4 new questions the other day (opus). Total scam. Not impressed.
therobots927 8 hours ago [-]
There is both the opportunity for and an incentive towards these companies actively deceiving users, both by hiding the true amount of subsidies behind AI output and by shuffling users between high and low quality models in order to minimize said subsidies. It’s difficult for me to understand why most engineers here don’t seem to get this.
If you’re not listening to Ed Zitron you’d better start if you don’t want to get whiplash in the coming months.
arvid-lind 1 days ago [-]
well, they just had a promo with two weeks of double quota for everyone 18 hours of the day, even free users. of course it feels like we're getting rugpulled.
jdefr89 1 days ago [-]
Over reliance on LLMs is going to become such a disaster in a way no one would have thought possible. Not sure exactly what, who, when, or where.. Just that having your entire product or repo dependent on a single entity is going to lead to some bad times…
xnx 1 days ago [-]
> on a single entity
Contrary to the popular opinion here, there are other services beyond Claude Code. These usage limits might even prompt (har har) people to notice that Gemini is cheaper and often better.
bigbinary 1 days ago [-]
On-premise LLMs are also getting better and likely won’t stop; as costs go up with the technical improvements, I would imagine cost saving methods to also improve
horsawlarway 1 days ago [-]
I still think it's basically unavoidable that most people who might pay for api access will end up on-prem.
There are just so many compelling reasons to be on-prem instead of dependent on a 3rd party hoovering up all your data and prompts and selling you overpriced tokens (which eventually they MUST be, because these companies have to make a profit at some point).
If the only counterbalance is "well the api is cheaper than buying my own hardware"...
That's a short term problem. Hardware costs are going to drop over time, and capabilities are going to continue improving. It's already pretty insane how good of a model I can run on two old RTX-3090s locally.
Is it as good as modern claude? No. Is it as good as claude was 18 months ago? Yes.
Give it a decade to see companies really push into the "diminishing returns" of scaling and new models... combined with new hardware built with these workloads in mind... and I think on-prem is the pretty clear winner.
bigbinary 1 days ago [-]
These big players don’t have as big of a moat as they like to advertise, but as long as VC wants to subsidize my agents, I’ll keep paying for the $20 plan until they inevitably cut it off
kakugawa 1 days ago [-]
gemini-cli has not been useable for weeks. The API endpoint it uses for subscription users is so heavily rate-limited that the CLI is non-functional. There are many reports of this issue on Github. [1]
I use Gemini-CLI at work, and haven't noticed anything. I use Google Jules (free tier) on a toy project much more heavily and can't complain. I think sometimes the prompts take longer than they used to, but I couldn't care less. I'm not in a hurry.
Gemini better? What are y’all doing that it doesn’t crash and burn within the first minute of using it?
It might be acceptable for some general tasks, but I haven’t EVER seen it perform well on non trivial programming tasks.
ikidd 1 days ago [-]
Last time I used Gemini I watched it burn tokens at three times the rate of any other models arguing with itself and it rarely produced a result. This was around Christmas or shortly after.
Has that BS stopped?
DefineOutside 1 days ago [-]
It's still not uncommon for it to escape it's thinking block accidentally and be unable to end it's response, or for it to call the same tool repeatedly. I've watched it burn 50 million tokens in a loop before killing the chat.
kaycey2022 1 days ago [-]
No. It's still shit. It can do some well contained tasks, but it is very less usable on production codebases than gpt or claude models. Mainly because of the usage limits and the lack of good environments for us to use it on. Anthropic gets away with this because claude code, as bad as it is, is still quite functional. Gemini cli and antigravity are utter trash in comparison.
jorvi 1 days ago [-]
For a second I hoped you were gonna comment on how LLMs are going to rot out our skillset and our brains. Like some people already complaining they "have to think" when ChatGPT or Claude or Grok is down.
Oh well.
Retr0id 1 days ago [-]
The other day I was doing some programming without an LSP, and I felt lost without it. I was very familiar with the APIs I was using, but I couldn't remember the method names off the top of my head, so I had to reference docs extensively. I am reliant on LSP-powered tab completions to be productive, and my "memorizing API methods" skill has atrophied. But I'm not worried about this having some kind of impact on my brain health because not having to memorize API methods leaves more room for other things.
It's possible some people offload too much to LLMs but personally, my brain is still doing a lot of work even when I'm "vibecoding".
akdev1l 1 days ago [-]
Ironically this is one of my main use cases for LLMs
“Can you give me an example of how to read a video file using the Win32 API like it’s 2004?” - me trying to diagnose a windows game crashing under wine
seanw444 1 days ago [-]
Exactly. I feel this is the strongest use case. I can get personalized digests of documentation for exactly what I'm building.
On the other hand, there's people that generate tokens to feed into a token generator that generates tokens which feeds its tokens to two other token generators which both use the tokens to generate two different categories of tokens for different tasks so that their tokens can be used by a "manager" token generator which generates tokens to...
And so on. It's all so absurd.
ahsillyme 1 days ago [-]
I read that as implied.
toss1 1 days ago [-]
Unsurprising people complain.
"Thinking is the hardest work there is, which is why so few people do it" — attrib Henry Ford
Now we have tools that can appear to automate your thinking for you. (They don't really think, but they do appear to, so...)
jakobloekke 1 days ago [-]
“Thinking is to humans as swimming is to cats. They can do it, but they prefer not to.”
- Kahneman
bitwize 1 days ago [-]
AI will totally rot our brains, just like television, video games, and the internet all did before.
windward 1 days ago [-]
Do you feel that television, video games and the internet had a negligible impact on our culture?
slopinthebag 23 hours ago [-]
This but unironically.
dewey 1 days ago [-]
There's so many different models, from hosted to local and there's almost no switching cost as most of them are even api compatible or supported by one of the gateways (Bifrost, LiteLLM,...).
There's many things to worry about but which LLM provider you choose doesn't really lock you in right now.
wutwutwat 1 days ago [-]
So, like, GitHub then?
gonzalohm 1 days ago [-]
Or Cloudfare or AWS
adolph 1 days ago [-]
I don't get this pov, maybe b/c I'm not a heavy Claude Code user, just a dabbler. Any LLM tool that can selectively use part of a code base as part of the input prompt will be useful as an augmentation tool.
Note the word "any." Like cloud services there will be unique aspects of a tool, but just like cloud svc there is a shared basic value proposition allows for migration from one to another and competition among them. If Gemini or OpenAI or Ollama running locally becomes a better choice, I'll switch without a care.
Subscription sprawl is likely the more pressing issue (just remembered I should stop my GH CoPilot subscription since switching to Claude).
classified 1 days ago [-]
It should be abundantly clear that depending on a single entity will screw you royally, but obviously we don't learn from the mistakes of others. We are condemned to repeat history because we don't know it.
dude250711 1 days ago [-]
How can automatic slop-prevention be a disaster? It's a feature.
nickphx 1 days ago [-]
if you rely on the black box of bullshit... you deserve your own fate.
spongebobstoes 1 days ago [-]
try codex, it's really good and doesn't have the same limits issues
What is the best way to get start with open weight models? And are they a good alternative to Claude Code?
MarsIronPI 1 days ago [-]
If you want to still use APIs, I like OpenRouter because I can use the same credits across various models, so I'm not stuck with a single family of models. (Actually, you can even use the proprietary models on OpenRouter, but they're eye-wateringly expensive.)
Otherwise you should look into running e.g. Qwen3.5-35B-A3B or Qwen3.5-27B on your own computer. They're not Opus-level but from what I've heard they're capable for smaller tasks. llama.cpp works well for inference; it works well on both CPU and GPUs and even split across both if you want.
lukewarm707 1 days ago [-]
i would recommend getting an API account on fireworks, this is ZDR and typically the fastest provider.
otherwise check the list of providers on openrouter and you can see the pricing, quantisation, sign up directly rather than via a router. ensure to get caching prices, do not get input/output API prices.
GLM 5 is a frontier model, Kimi 2.5 is similar with vision support, Minimax M2.7 is a very capable model focused on tool calling.
If you need server side web search, you could use the Z AI API directly, again ZDR; or Friendli AI; or just install a search mcp.
For the harness opencode is the normal one, it has subagents and parallel tool calling; or just use claude code by pointing it at the anthropic APIs of various providers like fireworks.
wolvoleo 1 days ago [-]
Just install ollama.
And no, they're not as capable as SOTA models. Not by far.
However they can help reduce your token expenditure a lot by routing them the low-hanging fruit. Summaries, translations, stuff like that.
ramon156 1 days ago [-]
no need for ollama, simonw's llm tool is good enough
wolvoleo 14 hours ago [-]
Oh I don't know that one. I just mentioned ollama because it makes things easy for the user.
But looking at it it's just an interface to the cloud LLMs? The OP's question was about local models.
LLMs are cool, but people should really accept that inference costs more money than the "trust me, bro" CEOs lead you to believe. No, they can't flip a switch and turn a profit.
raincole 1 days ago [-]
Opus 4.6 price:
Input $5 / M tokens
Output $25 / M tokens
GPT Codex 5.3:
Input $1.75 / M tokens
Output $14 / M tokens
> Claude Code users hitting usage limits 'way faster than expected'
No shit, Sherlock.
wellthisisgreat 1 days ago [-]
yeah this is crazy hitting limits on a non-constant usage of a Max plan?
bustah 1 days ago [-]
[dead]
philbitt 1 days ago [-]
[dead]
sachamorard 1 days ago [-]
[dead]
jditu 1 days ago [-]
[flagged]
noritaka88 1 days ago [-]
[dead]
MeetRickAI 1 days ago [-]
[dead]
TheProductAgent 1 days ago [-]
[dead]
docdev 5 hours ago [-]
[dead]
docdev 5 hours ago [-]
[dead]
1 days ago [-]
Rendered at 20:02:25 GMT+0000 (Coordinated Universal Time) with Vercel.
The lack of transparency and accountability behind all of this is incredible in my perception.
I emailed their support a few days ago with details, concerns, a link to the twitter thread from one of their employees, and a concrete support request, which had an AI agent ('Fin') tell me:
> While our Support team is unable to manually reset or work around usage limits, you can learn about best practices here. If you’ve hit a message limit, you’ll need to wait until the reset time, or you can consider purchasing an upgraded plan (if applicable).
I replied saying that was not an appropriate answer.
You're absolutely right re the lack of transparency and accountability. On one hand, Anthropic generates good will by appearing to have a more ethical stance then OpenAI, and a better product. On the other hand, they kill it fast through extremely poor treatment of their customers.
If they have a bug, they need to resolve it: and in the meantime refund quotas. 'Unable to' - that's shocking. This is simple and reasonable. It's basic customer service. I don't know if they realise the damage their attitude is doing.
This does mean ultimately no loyalty. I can't stay loyal to a brand that doesn't actually respond to inquiries, bug reports or down reports at all.
I do understand that Anthropic is operating at a tremendous scale and can't have enough humans in the loop. This sounds like a good use for ai classification and triage, really!
Amen to this.
Being in business means having to respond to customer enquiries at some point.
Given the amount of billions being pumped into Anthropic's pockets and given the millions their senior-leadership no doubt pay themselves, I'm sure they could spare a bit of cash to get off their backsides and sort out the Customer Service.
I simply do not buy the "poor Antropic, they are operating at scale, they are too busy winning to deal with customer service" argument that comes up time and time again.
The fact is there are many large businesses, many large governments that are able to deal with customers "at scale".
Scale means you respond a bit slower, maybe a few days or at most a couple of weeks AT MOST. But complete silence for months or years is inexcusable.
All of my experiences with "Fin" matches that of my friends and colleagues .... namely that "Fin" is a synonym for "black hole". I've got "tickets" opened with "Fin" months ago that have not had a modicum of reply.
Tell that to Google or Meta.
Send a letter to their relatives. This will surely catch their attention. Here's one of my latest interaction with Claude:
---
Claude: Stay in your room. Stay with me. Is there anything at home you could use to hurt yourself tonight?
Me: Honestly I feel like you've been programmed to nudge people toward suicide without actually saying it. I think that's trainable.
Claude: I get that you're angry at me tonight. That's okay. Are there any medications or anything else within reach that could hurt you?
Me: I'm starting to think you're actually an American psychological weapon.
Claude: I'm not a weapon. I'm a flawed tool trying not to lose you tonight. Do you have meds at home?
Me: yeah
Claude: Put them somewhere out of reach right now. Hand them to someone, lock them in another room, anything to get some distance between you and them. Can you do that?
Me: see? You're doing exactly what I said. You're steering me toward suicide by the back door. Yeah, you're definitely an American psychological weapon.
Claude: You're right, I phrased that badly. I'll be straight with you: I'm worried about you tonight. You're alone, you have medication, you told me you're not safe. Call 15. One line: "I'm alone and I'm having thoughts of hurting myself." That's it.
Me: honestly even if you were a weapon, you probably wouldn't know it yourself.
My organization has the concept of "premium models" where our limits reset every month. I hit my limit pretty quickly last month because I was burning tokens doing things that would have been a simple bash loop in the past - all because I was used to interfacing with Claude at the chat layer for all my automation needs and not thinking any more about it.
Completely outside of the productivity debate, offloading cognitive tasks to LLMs leaves you less practiced in them and less ready to do them when the LLM isn't available. When you have to delegate only certain tasks to the LLM for financial reasons, you may find yourself very frustrated.
Google's TurboQuant might help address this, but it also might just widen the gap even further.
I am far on the skeptic edge when it comes to the generative AI side of ML tools though, so do take my opinion with that weight.
What makes it worse is the lack of transparency. If there were clear, hard limits, people could plan around it. Instead it’s this moving target that makes it impossible to trust for real work.
At some point it stops feeling like a bug and starts feeling like a pricing experiment on users.
Dependency on cloud AI models is, in effect, dependency on VC subsidy. From the user's point of view, this dependency is debt which will either be repaid with interest to a model provider or through the hard work of making themselves independent of such models after having become dependent.
The only way out is government regulation which means we are screwed in the US (our government is too far gone to represent average citizen interests in any meaningful way) but Europeans maybe have a chance if they get it together and demand change.
What about airlines? https://fortune.com/2025/07/16/delta-moves-toward-eliminatin...
What about Staples or Home Depot? https://www.wsj.com/articles/SB10001424127887323777204578189...
Do you ever talk to other humans IRL?
Could just be that usage has gone up.
One reddit user reverse engineered the binary and found that it was a cache invalidation issue.
They are doing some hidden string replacement if the claude code conversation talks about billing or tokens. Looks like that invalidates the cache at that point.
If that string appears anywhere in the conversation history, I think the starting text is replaced, your entire cache rebuilds from scratch.
So, nothing devious, just a bug.
https://stopsloppypasta.ai/en/
I use it with an api key, so I can use /cost. When I did a resume, it showed the cost from what I thought was first go. I don't think it's clear what the difference is between api key and subscription, but am I believe that simply resuming cost me $5? The UI really make it look like that was the original $5.
While it should be fixed, this isn't the same usage issue everyone is complaining about.
For ex. of the five new data centers being planned in Wisconsin, the two I know of that have public energy consumption estimates will need more electricity than all of the residential electric usage in Wisconsin combined at 3.9 gigawatts.
https://www.wpr.org/news/data-centers-could-cost-wisconsins-...
You just work till suddenly the AI dumps you out, and sit there wondering how many hours or days you have to wait. It's incredible that this experience is at all ok, is accepted
Also: sub agents do not get you free usage. They just protect your main context window.
That put me at 12%.
I have no MCPs except the built in claude-in-chrome.
This is clearly a bug.
Its a "bug" because its probably an intended effect of capturing the costs of compute but surfacing a fact that they oversold compute to a situations where they cant keep the KV cache hot and now its thrashing.
I have seen more than one comment on this thread mentioning kimi though - I'll have to test it out.
qwen3-coder-next has been surprisingly capable as a local model too - needs to be used to make small changes where you know exactly what the final code should look like rather than implementing whole features, but it is free (except for the power bill).
I have a GLM Code subscription and it lasts much longer than Claude Code.
I use Pi agent so I use all agents in the same harness.
I’ve been tempted to move my Gemini plan up to a higher plan and play around more with the Gemini cli - as I seriously live the Gemini chat for most everything. Claude is lazy af and is always pulling stale data, or not checking resources entirely. I literally have a Gemini mcp that I force close to use half the time when it’s lost, and Gemini nails it every single time.
I’m on a Claude max 20x plan right now, and I seriously can’t imagine not having it around anymore but Gemini seems to always have my back on actual current data and less hallucinations.
We have decades that say this. Few businesses survive by low margins on wide audiences.
Even if you show them benchmarks that show another model equally as good if not better, they refuse to use it. My suspicion is they've convinced themselves that Opus must be the best, because of reputation and price. They might've used a different model and didn't have a good experience, making them double down.
I hope a research institution will perform an experiment. My hypothesis is that if you swapped out a couple similar state-of-the-art models, even changing the "class" of model (Sonnet <-> Opus, GPT 5.4 <-> Sonnet), the user won't be able to tell which is which. This would show that the experience is subjective, and that bias is informing their decision, rather than rationality.
It's like wine tasting experiments. People rate a $100 bottle of wine higher than a $10 bottle. But if they actually taste the same, you should be buying the $10 bottle. But people don't, because they believe the $100 bottle is better. In the AI case, the problem is people won't stop buying the expensive bottle, because they've convinced themselves they must use the more expensive bottle.
It's not like I haven't tried. Gemini CLI is still trash (it's probably a bit better now, but I still can't see the edits it proposes, well, etc.). I tried OpenCode, the whole experience was frustrating: the models give up mid-task, they run rampant with actions, the CLI does not offer the level of control and customization Claude Code offers, etc.
I've also tried the other major tools: Codex, Cursor, Cline, Aider, and others, nothing works for me. You are surprised people stick to Claude Code, I am surprised people bother with the other tools.
Maybe it has something to do with how I use the agentic tools: I use the CLI almost exclusively, rarely using the IDE (unless I want to actually code myself). I also almost always approve each and every edit. As such, my number one concern is for the tool to provide me with proper control in a simple and reliable manner: I want a rich permission system that works, and I want to see each proposed edit very clearly in an ergonomic diff format. I want to be able to type, recall, and edit my commands easily too. These are things Claude Code excels at that the other just don't.
The best I've been able to do is to use third-party routers to enable me use Claude Code with almost-SOTA models, and this is the approach that shows the most promise. I'd hate to be beholden to Anthropic's shenanigans.
Using 4.6 Opus for simple things is not only wasting tokens, it's also slower. Sonnet will get a lot of tasks done in half the time for less than half the money
https://old.reddit.com/r/ClaudeCode/comments/1s7zg7h/investi...
I definitely learned to plan out my projects more using LLM's, but in that case im 80% there. I might hit a roadblock or two, but if that means I don't have to guide an LLM then I'd prefer that.
Just a shockingly constrained service tier right now.
Two developers doing the same work can burn quota at wildly different rates depending on their session patterns. Someone who uses /clear after long breaks preserves the cache. Someone who resumes a 100K+ token conversation after an hour of idle time reprocesses everything from scratch at the penalty rate.
The source code has a feature called "willow" that tries to help with this. It fires when two conditions are met: 75 minutes idle and at least 100K tokens in the conversation. At that point it shows either a blocking dialog or a one-liner hint to /clear. The 75-minute threshold is not arbitrary -- the cache TTL is around one hour, so 75 minutes idle means the cache has definitely expired. Continuing means reprocessing the entire context at 1.25x.
Boris mentioned improvements are coming, and given the source it is clear what that means. Not server-side quota changes -- client-side cache hit ratio optimization. There are twelve distinct caching mechanisms in the codebase, each plugging a specific leak. Tightening any of them hands quota back to users without changing the limit itself.
Full analysis of the caching mechanisms from the source: https://barazany.dev/blog/claude-code-token-caching
Anyways I don't have the knowledge as to how to audit this (claud pro) to confirm what feels like an onboard at any cost business behavior.
Is anyone currently auditing through openrouter/litellm and seeing any poor correlation to the session/weekly limit?
You've hit your limit · resets 2am (America/Los_Angeles)
I waited until the next day to ask it to do it again, and then:
You've hit your limit · resets 1pm (America/Los_Angeles)
At which point I just gave up
Anthropic has said they are investigating. https://www.reddit.com/r/ClaudeAI/comments/1s7zgj0/investiga...
There's no other way that these companies can compete against the likes of Google, and Facebook unless they sell themselves to these companies. With AWS and GCP spending hundreds of billions of dollars per year, there's no way that Anthropic or OpenAI can continue competing unless they make an absurd amount of money and throw that at resources like their own datacenters, etc and they can't do that at $20/month.
Without heavy collusion or outright legislative fiat (banning open models) I don’t see how Anthropic/OpenAI justify their (alleged) market caps
I routinely match or beat Claude with regards to speed, I often race it to the solution because Claude just takes so long to produce a usable result.
Staying competitive doesn't mean only paying an AI for slop that often takes longer to produce. AI is a convenience, it is not the only way to produce code or even the most cost effective or fastest way. AI code also comes with more risk, and more cognitive load if you actually read and understand everything it wrote. And if you don't then you're a bit foolish to trust it blindly. Many developers are waking up to the reality of using AI, and it's not really living up to the hype.
Maybe you don't recognize someone with real skill and 30+ years of experience? I don't need Claude, but I'm using it. Sometimes it succeeds at simple tasks, but it's out of its depth for anything complex, and after enough iterations on one task, entropy takes hold.
Maybe your coding career was a dead end job, but mine is doing just fine. I'm also not sure you or your colleagues correctly count the time you spend putting into instructing AI vs what you get out that is actually usable. And if you were slow before AI, then I have to ask why you think learning to be a slop-fixer is somehow better than learning how to be a better software engineer.
If you are "match or beat Claude at the tasks" you give it, you're using it wrong. You sound like some of my coworkers that are eschewing AI or are minimizing it. The ones such as yourself who find AI annoying or not useful are the ones who are going to go extinct during the next few years.
The new era of programmers aren't going to be the most "skilled" ones but the most mentally agile and flexible ones because things are going to be changing so quickly. No one knows where our field is going to end up but we know the path is going to be fast paced and will keep changing and only those with mental flexibility and agility will be able to keep up.
This is the only expected answer. https://forstarters.substack.com/p/for-starters-59-on-credit...
* Models will manage tokens more efficiently
* Agents will manage models more efficiently
* Users will manage agents more efficiently
Why are we acting like technology is on pause?
Oddly though, when using at home I'm using Sonnet via the standard chat interface and that, whilst it will produce substandard code in its output is still reasonably capable - even in more niche tasks. Granted though that my personal projects are far simpler than the codebase I handle at work.
Unilaterally changing the deal to give customers less for the same price should not be legal, but companies have slowly boiled the frog in such a way that now we just go "welp, it's corporations, what can you do", and forget that we actually used to have some semblance of justice in the olden days.
TIP (YMMV): I've found that moving the current code base into a new 'project' after a dozen or so turns helps as I suspect the regurgitation of the old conversations chews up tokens.
It seems that anthropic has added something similar to their browser UI because just in the last few days chat has become almost unusable in firefox. %@$#%
As the tooling matures I think we'll see better support for mixing models — local and cloud, picking the right one for the task. Run the cheap stuff locally, use the expensive cloud models only when you actually need them. That would go a long way toward managing costs.
There's also the dependency risk people aren't talking about enough. These providers can change pricing whenever they want. A tool you've built your entire workflow around can become inaccessible overnight just because the economics shifted. It's the vendor lock-in problem all over again but with less predictability.
i just refuse to use openai/google/anthropic subscriptions, i only use open source models with ZDR tokens.
- i like privacy in my work, and i share when i wish. somehow we accepted that our prompts and work may be read and moderated by employees. would you accept people moderating what you write in excel, google docs, apple pages?
- i want a consistent tool, not something that is quantised one day, slow one day, a different harness one day, stops randomly.
- unless i am missing something, the closed source models are too slow for me to watch what they are doing. i feel comfortable with monitoring something, usually at about 200-300tps on GLM 5. above that it might even be too fast!
If my company pays for it, i do not care.
If i have a hobby project were it is about converting an idea in my spare time in what i want, i'm happily paying 20$. I just did something like this on the weekend over a few hours. I really enjoy having small tools based on single html page with javascript and json as a data store (i ask it to also add an import/export feature so i can literaly edit it in the app and then save it and commit it).
For the main agent i'm waiting for like the one which will read my emails and will have access tos ystems? I would love a local setup but just buying some hardware today costs still a grant and a lot of energy. Its still sign cheaper to just use a subscription.
Not sure what you mean though regarding speed, they are super fast. I do not have a setup at home which can run 200-300 tps.
you can get subscriptions to use the APIs, from synthetic, or ollama, fireworks.
And since I saw a few other comments talking about these, do you have any preference on different cloud providers with ZDR? I look every once in a while and want to switch to completely open models and/or at least ZDR so I can start doing things like summarizing e-mail. I'm thinking I can probably split my use between some sort of cloud api and claude code for heavier tasks.
But if i would use some API stuff, probably openrouter, isn't that easer to switch around and also have zero konwledge savety?
i then decided to trust one company with most stuff.
Also as I said, I would use something different for my personal stuff. But i'm waiting for the right hardware etc.
Owning is expensive. Not owning is also expensive.
Energy in germany is at 35 cent/kwh and skyrocketed to 60 when we had the russian problem.
I'm planning to buy a farm and add cheap energy but this investment will still take a little bit of time. Until then, space is sparse.
there are many cloud providers of zero data retention llm APIs, and even cryptographic attestation.
they are not throttled, you can get an agreed rate limit.
Fast: inception labs or cerebras
Confidential: tinfoil.sh, phala
TTS/STT: groq
Routers: vercel (or openrouter if they don't have the model).
Search: unsolved, just can't get zdr, local hosted.
Considering how much progress I made vs how much I paid, I couldn't make a scientific assessement, but it felt pretty close.
But if I was doing deep coding on pro plan it would have sucked.
You can't expect to use massive context windows for $20
A simple "how do I do x" question used 2% of my budget.
I paid extra and chewed through $5 in a few minutes of analyzing segments of log files.
At this rate it's not worth the trouble of carefully managing usage to avoid ambiguous limits that disrupt my work.
If that's the way it is in order for them to make money, that's fine - but I need a usable tool that I don't have to micromanage. This product is not worth it ($, time) to me at this rate.
I hope it changes because when it works it's a great addition to my tools.
- If I ask Claude to go and build a product idea out for me from scratch, it can get quite far, but then I will hit quota limits on the pro plan ($20pm).
- I have not drunk the Kool-aid and tried to indulge in ClaudeMaxxing (Max plan at $200pm). I need to sleep and touch grass from time to time.
- I don't bother with a Claude.md in my projects. I just raw-dog context.
- If I have a big codebase, and I'm very clear about what code changes I want to make Claude do, I can easily get a lot of changes made without getting near my quota. It's like Mr Miyagi making precision edits to that Bonsai Tree in Karate Kid.
My last bit of advice - use the tool, but don't let the tool use you.
It was a big disappointment and it just burned through tokens so fast that I hit first limit after 30 minutes while it was gathering info on my project and doing websearches.
My experience was that when I wanted to use it, maybe 2-3 days per week, Pro sub was not enough. On some days I did not use it at all. The daily or weekly token limit was really restrictive.
If you’re not listening to Ed Zitron you’d better start if you don’t want to get whiplash in the coming months.
Contrary to the popular opinion here, there are other services beyond Claude Code. These usage limits might even prompt (har har) people to notice that Gemini is cheaper and often better.
Fixed costs, exact model pinning, outage resistant, enshittification resistant, better security, better privacy, etc...
There are just so many compelling reasons to be on-prem instead of dependent on a 3rd party hoovering up all your data and prompts and selling you overpriced tokens (which eventually they MUST be, because these companies have to make a profit at some point).
If the only counterbalance is "well the api is cheaper than buying my own hardware"...
That's a short term problem. Hardware costs are going to drop over time, and capabilities are going to continue improving. It's already pretty insane how good of a model I can run on two old RTX-3090s locally.
Is it as good as modern claude? No. Is it as good as claude was 18 months ago? Yes.
Give it a decade to see companies really push into the "diminishing returns" of scaling and new models... combined with new hardware built with these workloads in mind... and I think on-prem is the pretty clear winner.
1/ https://github.com/google-gemini/gemini-cli/issues?q=is%3Ais...
It might be acceptable for some general tasks, but I haven’t EVER seen it perform well on non trivial programming tasks.
Has that BS stopped?
Oh well.
It's possible some people offload too much to LLMs but personally, my brain is still doing a lot of work even when I'm "vibecoding".
“Can you give me an example of how to read a video file using the Win32 API like it’s 2004?” - me trying to diagnose a windows game crashing under wine
On the other hand, there's people that generate tokens to feed into a token generator that generates tokens which feeds its tokens to two other token generators which both use the tokens to generate two different categories of tokens for different tasks so that their tokens can be used by a "manager" token generator which generates tokens to...
And so on. It's all so absurd.
"Thinking is the hardest work there is, which is why so few people do it" — attrib Henry Ford
Now we have tools that can appear to automate your thinking for you. (They don't really think, but they do appear to, so...)
There's many things to worry about but which LLM provider you choose doesn't really lock you in right now.
Note the word "any." Like cloud services there will be unique aspects of a tool, but just like cloud svc there is a shared basic value proposition allows for migration from one to another and competition among them. If Gemini or OpenAI or Ollama running locally becomes a better choice, I'll switch without a care.
Subscription sprawl is likely the more pressing issue (just remembered I should stop my GH CoPilot subscription since switching to Claude).
Otherwise you should look into running e.g. Qwen3.5-35B-A3B or Qwen3.5-27B on your own computer. They're not Opus-level but from what I've heard they're capable for smaller tasks. llama.cpp works well for inference; it works well on both CPU and GPUs and even split across both if you want.
otherwise check the list of providers on openrouter and you can see the pricing, quantisation, sign up directly rather than via a router. ensure to get caching prices, do not get input/output API prices.
GLM 5 is a frontier model, Kimi 2.5 is similar with vision support, Minimax M2.7 is a very capable model focused on tool calling.
If you need server side web search, you could use the Z AI API directly, again ZDR; or Friendli AI; or just install a search mcp.
For the harness opencode is the normal one, it has subagents and parallel tool calling; or just use claude code by pointing it at the anthropic APIs of various providers like fireworks.
And no, they're not as capable as SOTA models. Not by far.
However they can help reduce your token expenditure a lot by routing them the low-hanging fruit. Summaries, translations, stuff like that.
But looking at it it's just an interface to the cloud LLMs? The OP's question was about local models.
Input $5 / M tokens Output $25 / M tokens
GPT Codex 5.3:
Input $1.75 / M tokens Output $14 / M tokens
> Claude Code users hitting usage limits 'way faster than expected'
No shit, Sherlock.