NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
MCP server that reduces Claude Code context consumption by 98% (mksg.lu)
blakec 18 hours ago [-]
The FTS5 index approach here is right, but I'd push further: pure BM25 underperforms on tool outputs because they're a mix of structured data (JSON, tables, config) and natural language (comments, error messages, docstrings). Keyword matching falls apart on the structured half.

I built a hybrid retriever for a similar problem, compressing a 15,800-file Obsidian vault into a searchable index for Claude Code. Stack is Model2Vec (potion-base-8M, 256-dimensional embeddings) + sqlite-vec for vector search + FTS5 for BM25, combined via Reciprocal Rank Fusion. The database is 49,746 chunks in 83MB. RRF is the important piece: it merges ranked lists from both retrieval methods without needing score calibration, so you get BM25's exact-match precision on identifiers and function names plus vector search's semantic matching on descriptions and error context.

The incremental indexing matters too. If you're indexing tool outputs per-session, the corpus grows fast. My indexer has a --incremental flag that hashes content and only re-embeds changed chunks. Full reindex of 15,800 files takes ~4 minutes; incremental on a typical day's changes is under 10 seconds.

On the caching question raised upthread: this approach actually helps prompt caching because the compressed output is deterministic for the same query. The raw tool output would be different every time (timestamps, ordering), but the retrieved summary is stable if the underlying data hasn't changed.

One thing I'd add to Context Mode's architecture: the same retriever could run as a PostToolUse hook, compressing outputs before they enter the conversation. That way it's transparent to the agent, it never sees the raw dump, just the relevant subset.

thecopy 10 hours ago [-]
Very interesting, one big wrinkle with OP:s approach is exactly that, the structured responses are un-touched, which many tools return. Solution in OP as i understand it is the "execute" method. However, im building an MCP gateway, and such sandboxed execution isnt available (...yet), so your approach to this sounds very clever. Ill spend this day trying that out
doctorpangloss 4 hours ago [-]
The LLM that wrote the comment you are replying to has no idea what it is talking about...
danw1979 14 hours ago [-]
Would love to read a more in depth write up of this if you have the time !

I suspect the obsessive note-taker crowd on HN would appreciate it too.

tclancy 10 hours ago [-]
Seconded that I would love to see the what, why and how of your Obsidian work.
mksglu 2 days ago [-]
Author here. I shared the GitHub repo a few days ago (https://news.ycombinator.com/item?id=47148025) and got great feedback. This is the writeup explaining the architecture.

The core idea: every MCP tool call dumps raw data into your 200K context window. Context Mode spawns isolated subprocesses — only stdout enters context. No LLM calls, purely algorithmic: SQLite FTS5 with BM25 ranking and Porter stemming.

Since the last post we've seen 228 stars and some real-world usage data. The biggest surprise was how much subagent routing matters — auto-upgrading Bash subagents to general-purpose so they can use batch_execute instead of flooding context with raw output.

Source: https://github.com/mksglu/claude-context-mode Happy to answer any architecture questions.

lkbm 20 hours ago [-]
Small suggestion: Link to the Cloudflare Code mode post[0] in the blog post where you mentio it. It's linked in the README, but when I saw it in the blog post, I had to Google it.

[0] https://blog.cloudflare.com/code-mode-mcp/

re5i5tor 1 days ago [-]
Really intrigued and def will try, thanks for this.

In connecting the dots (and help me make sure I'm connecting them correctly), context-mode _does not address MCP context usage at all_, correct? You are instead suggesting we refactor or eliminate MCP tools, or apply concepts similar to context_mode in our MCPs where possible?

Context-mode is still very high value, even if the answer is "no," just want to make sure I understand. Also interested in your thoughts about the above.

I write a number of MCPs that work across all Claude surfaces; so the usual "CLI!" isn't as viable an answer (though with code execution it sometimes can be) ...

Edit: typo

mksglu 1 days ago [-]
Right, context-mode doesn't change how MCP tool definitions get loaded into context. That's the "input side" problem that Cloudflare's Code Mode tackles by compressing tool schemas. Context-mode handles the "output side," the data that comes back from tool calls. That said, if you're writing your own MCPs, you could apply the same pattern directly. Instead of returning raw payloads, have your MCP server return a compact summary and store the full output somewhere queryable. Context-mode just generalizes that so you don't have to rebuild it per server.
re5i5tor 23 hours ago [-]
Hmmm. I was talking about the output side. When data comes back from an MCP tool call, context-mode is still not in the loop, not able to help, is it?

Edit: clarify "MCP tool"

re5i5tor 22 hours ago [-]
I dug into this further. Tested empirically and read the code.

Confirmed: context-mode cannot intercept MCP tool responses. The PreToolUse hook (hooks/pretooluse.sh) matches only Bash|Read|Grep|Glob|WebFetch|WebSearch|Task. When I called my obsidian MCP's obsidian_list via MCP, the response went straight into context — zero entries in context-mode's FTS5 database. The web fetches from the same session were all indexed.

The context-mode skill (SKILL.md) actually acknowledges this at lines 71-77 with an "after-the-fact" decision tree for MCP output: if it's already in context, use it directly; if you need to search it again, save to file then index. But that's damage control — the context is already consumed. You can't un-eat those tokens.

The architectural reason: MCP tool responses flow via JSON-RPC directly to the model. There's no PostToolUse hook in Claude Code that could modify or compress a response before it enters context. And you can't call MCP tools from inside a subprocess, so the "run it in a sandbox" pattern doesn't apply.

So the 98% savings are real but scoped to built-in tools and CLI wrappers (curl, gh, kubectl, etc.) — anything replicable in a subprocess. For third-party MCP tools with unique capabilities (Excalidraw rendering, calendar APIs, Obsidian vault access), the MCP author has to apply context-mode's concepts server-side: return compact summaries, store full output queryably, expose drill-down tools. Which is essentially what you suggested above.

Still very high value for the built-in tool side. Just want the boundary to be clear.

Correct any misconceptions please!

1 days ago [-]
nextaccountic 19 hours ago [-]
Can this be used with other agents? I'm looking specifically into the Zed Agent
nitinreddy88 22 hours ago [-]
Any reason why it doesn't support Codex? I believe the idea and implementation seems to be pretty much agent independent
esafak 1 days ago [-]
Does your technique break the cache? edit: Thanks.
doctorpangloss 4 hours ago [-]
The LLM that the "author" is using has no idea what it's talking about, and the reply you got is nonsense.

@dang it's really bad lately.

mksglu 1 days ago [-]
Nope. The raw data never enters the conversation history in the first place, so there's nothing to invalidate. Tool output runs in a sandbox, a short summary comes back, and the full data sits in a local FTS5 index. The conversation cache stays intact because the context itself doesn't change after the fact.
nr378 1 days ago [-]
Nice work.

It strikes me there's more low hanging fruit to pluck re. context window management. Backtracking strikes me as another promising direction to avoid context bloat and compaction (i.e. when a model takes a few attempts to do the right thing, once it's done the right thing, prune the failed attempts out of the context).

elephanlemon 1 days ago [-]
Agree. I’d like more fine grained control of context and compaction. If you spend time debugging in the middle of a session, once you’ve fixed the bugs you ought to be able to remove everything related to fixing them out of context and continue as you had before you encountered them. (Right now depending on your IDE this can be quite annoying to do manually. And I’m not aware of any that allow you to snip it out if you’ve worked with the agent on other tasks afterwards.)

I think agents should manage their own context too. For example, if you’re working with a tool that dumps a lot of logged information into context, those logs should get pruned out after one or two more prompts.

Context should be thought of something that can be freely manipulated, rather than a stack that can only have things appended or removed from the end.

nr378 1 days ago [-]
Oh that's quite a nice idea - agentic context management (riffing on agentic memory management).

There's some challenges around the LLM having enough output tokens to easily specify what it wants its next input tokens to be, but "snips" should be able to be expressed concisely (i.e. the next input should include everything sent previously except the chunk that starts XXX and ends YYY). The upside is tighter context, the downside is it'll bust the prompt cache (perhaps the optimal trade-off is to batch the snips).

mksglu 1 days ago [-]
Good point on prompt cache invalidation. Context-mode sidesteps this by never letting the bloat in to begin with, rather than snipping it out after. Tool output runs in a sandbox, a short summary enters context, and the raw data sits in a local search index. No cache busting because the big payload never hits the conversation history in the first place.
lowbloodsugar 16 hours ago [-]
So I built that in my chat harness. I just gave the agent a “prune” tool and it can remove shit it doesn’t need any more from its own context. But chat is last gen.
FuckButtons 1 days ago [-]
Yeah, the fact that we have treated context as immutable baffles me, it’s not like humans working memory keeps a perfect history of everything they’ve done over the last hour, it shouldn’t be that complicated to train a secondary model that just runs online compaction, eg: it runs a tool call, the model determines what’s Germaine to the conversion and prunes the rest, or some task gets completed, ok just leave a stub in the context that says completed x, with a tool available to see the details of x if it becomes relevant again.
mksglu 1 days ago [-]
That's pretty much the approach we took with context-mode. Tool outputs get processed in a sandbox, only a stub summary comes back into context, and the full details stay in a searchable FTS5 index the model can query on demand. Not trained into the model itself, but gets you most of the way there as a plugin today.
FuckButtons 19 hours ago [-]
This is a partial realization of the idea, but, for a long running agent the proportion of noise increases linearly with the session length, unless you take an appropriately large machete to the problem you’re still going to wind up with sub optimal results.
jerf 18 hours ago [-]
Yeah, I'd definitely like to be able to edit my context a lot more. And once you consider that you start seeing things in your head like "select this big chunk of context and ask the model to simply that part", or do things like fix the model trying to ingest too many tokens because it dumped a whole file in that it didn't realize was going to be as large as it was. There's about a half-dozen things like that that are immediately obviously useful.
esperent 1 days ago [-]
Is it because of caching? If the context changes arbitrarily every turn then you would have to throw away the cache.
FuckButtons 20 hours ago [-]
So use a block based cache and tune the block size to maximize the hit rate? This isn’t rocket science.
wonnage 17 hours ago [-]
This seems misguided, you have to cache a prefix due to attention.
21 hours ago [-]
dsclough 5 hours ago [-]
Trees in pi let you do this, after done debugging you move back up and continue, leaving all the debugging context in its own branch
MichaelDickens 20 hours ago [-]
> I think agents should manage their own context too.

My intuition is that this should be almost trivial. If I copy/paste your long coding session into an LLM and ask it which parts can be removed from context without losing much, I'm confident that it will know to remove the debugging bits.

bbatha 18 hours ago [-]
I generally do this when I arrive at the agent getting stuck at a test loop or whatever after injecting some later requirement in and tweaking. Once I hit a decent place I have the agent summarize, discard the branch (it’s part of the context too!) and start with the new prompt
esperent 1 days ago [-]
> For example, if you’re working with a tool that dumps a lot of logged information into context

I've set up a hook that blocks directly running certain common tools and instead tells Claude to pipe the output to a temporary file and search that for relevant info. There's still some noise where it tries to run the tool once, gets blocked, then runs it the right way. But it's better than before.

wonnage 17 hours ago [-]
I think telling it to run those in a subagent should accomplish the same thing and ensure only the answer makes it to the main context. Otherwise you will still have some bloat from reading the exact output, although in some cases that could be good if you’re debugging or something
esperent 11 hours ago [-]
Not really because it reliably greps or searches the file for relevant info. So far I haven't seen it ever load the whole file. It might be more efficient for the main thread to have a subagent do it but probably at a significant slowdown penalty when all I'm doing is linting or running tests. So this is probably a judgement call depending on the situation.
mullingitover 20 hours ago [-]
I’ve been wondering about this and just found this paper[1]: Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Looks interesting.

[1] https://arxiv.org/html/2510.04618v1

mksglu 1 days ago [-]
That's exactly what context-mode does for tool outputs. Instead of dumping raw logs and snapshots into context, it runs them in a sandbox and only returns a summary. The full data stays in a local FTS5 index so you can search it later when you need specifics.
8note 23 hours ago [-]
what i want is for the agent to initially get the full data and make the right decision based on it, then later it doesnt need to know as much about how it got there.

isnt that how thinking works? intermediate tokens that then get replaced with the reuslt?

8note 23 hours ago [-]
i think something kinda easy for that could be to pretend that pruned output was actually done by a subagent. copy the detailed logs out, and replace it with a compacted summary.
jaredsohn 22 hours ago [-]
Treat context like git shas. Yes, there is a specific order within a 'branch' but you should be able to do the equivalent of cherry-picking and rebasing it
snowhale 10 hours ago [-]
[dead]
mksglu 1 days ago [-]
Totally agree. Failed attempts are just noise once the right path is found. Auto-detecting retry patterns and pruning them down to the final working version feels very doable, especially for clear cases like lint or compilation fixes.
jonnycoder 1 days ago [-]
It feels like the late 1990s all over again, but instead of html and sql, it’s coding agents. This time around, a lot of us are well experienced at software engineering and so we can find optimizations simply by using claude code all day long. We get an idea, we work with ai to help create a detailed design and then let it develop it for us.
mksglu 1 days ago [-]
The people who spent years doing the work manually are the ones who immediately see where the bottlenecks are.
ip26 1 days ago [-]
Maybe the right answer is “why not both”, but subagents can also be used for that problem. That is, when something isn’t going as expected, fork a subagent to solve the problem and return with the answer.

It’s interesting to imagine a single model deciding to wipe its own memory though, and roll back in time to a past version of itself (only, with the answer to a vexing problem)

jon-wood 1 days ago [-]
I forget where now but I'm sure I read an article from one of the coding harness companies talking about how they'd done just that. Effectively it could pass a note to its past self saying "Path X doesn't work", and otherwise reset the context to any previous point.

I could see this working like some sort of undo tree, with multiple branches you can jump back and forth between.

IncreasePosts 22 hours ago [-]
I do this with my agents. Basically, every "work" oriented call spawns a subprocess which does not add anything to the parent context window. When the subprocess completes the task, I ask it to 1) provide a complete answer, 2) provide a succinct explanation of how the answer was arrived at, 3) provide a succinct explanation of any attempts which did not work, and 4) Anything learned during the process which may be useful in the future. Then, I feed those 4 answers back to the parent as if they were magically arrived at. Another thing I do for managing context window is, any tool/MCP call has its output piped into a file. The LLM then can only read parts of the file and only add that to its context if it is sufficient. For example, execute some command that produces a lot of output and ultimately ends in "Success!", the LLM can just tail the last line to see if it succeeded. If it did, the rest of the output doesn't need to be read. if it fails, usually the failure message is at the end of the log. Something I'm working on now is having a smaller local model summarize the log output and feed that summarization to the more powerful LLM (because I can run my local model for ~free, but it is no where near as capable as the cloud models). I don't keep up with SOTA so I have no idea if what I'm doing is well known or not, but it works for me and my set up.
vexorkai 11 hours ago [-]
This post made me realize I had zero visibility into where my Claude Code tokens were actually going, so I built a small companion CLI this morning: https://github.com/vexorkai/claude-trace

It parses ~/.claude/projects/*/*.jsonl and breaks usage down by session, tool, project, and timeline with cost estimates (including cache read/create split).

Context Mode solves output compression really well; this is more of a measurement layer so you can see where the burn is before/after changes.

Disclosure: I built it.

ericpauley 10 hours ago [-]
> I had zero visibility into where my Claude Code tokens were actually going

/context?

hereme888 16 hours ago [-]
The hooks seem too aggressive. Blocking all curl/wget/WebFetch and funneling everything through the sandbox for 56 KB snapshots sounds great, but not for curl api.example.com/health returning 200 bytes.

Compressing 153 git commits to 107 bytes means the LLM has to write the perfect extraction script before it can see the data. So if it writes a `git log --oneline | wc -l` when you needed specific commit messages, that information is gone.

The benchmarks assume the model always writes the right summarization code, which in practice it doesn't.

sagarpatil 14 hours ago [-]
Agreed. I removed it.
specialp 1 days ago [-]
Do you need 80+ tools in context? Even if reduced, why not use sub agents for areas of focus? Context is gold and the more you put into it unrelated to the problem at hand the worse your outcome is. Even if you don't hit the limit of the window. Would be like compressing data to read into a string limit rather than just chunking the data
mksglu 1 days ago [-]
That's a fair point and honestly the ideal approach. But in practice most people don't hand-curate their MCP server list per task. They install 5-6 servers and suddenly have 80 tools loaded by default. Context-mode doesn't solve the tool definition bloat, that's the input side problem. It handles the output side, when those tools actually run and dump data back. Even with a focused set of tools, a single Playwright snapshot or git log can burn 50k tokens. That's what gets sandboxed.
muddi900 11 hours ago [-]
A lot of this token usage can be avoided by using CLI apps instead of MCPs. For example, the github cli is very robust and does the job of the MCP with a fraction of the token cost
bear3r 6 hours ago [-]
yeah gh cli in particular is lean. though `gh pr view --json body,comments` can still flood context fast. the real win here is gatekeeping what hits context at all, regardless of source.
startages 14 hours ago [-]
Not bad, but it sacrifices accuracy and there are risks of causing more hallucinations from having incomplete data or agent writing bad extraction logic. So the whole MCP assumes Claude is smart enough to write good extraction scripts AND formulate good search queries. I'm sure thing could expand in the future to something better, but information preservation is a real issue in my experience.
buremba 1 days ago [-]
AFAIK Claude Code doesn't inject all the MCP output into the context. It limits 25k tokens and uses bash pipe operators to read the full output. That's at least what I see in the latest version.
mksglu 1 days ago [-]
That's true, Claude Code does truncate large outputs now. But 25k tokens is still a lot, especially when you're running multiple tools back to back. Three or four Playwright snapshots or a batch of GitHub issues and you've burned 100k tokens on raw data you only needed a few lines from. Context-mode typically brings that down to 1-2k per call while keeping the full output searchable if you need it later.
nharada 3 hours ago [-]
Nice, I like the idea. It sounds like qualitatively you haven't had any performance regressions while doing this, but have you tested it at all on any sort of benchmark or similar eval? I'm curious how well the actual system performs with less context like this. I mean it's possible it actually improves...
andai 1 days ago [-]
This article's specific brand of AI writing reminded me of Kevin's Small Talk

https://www.youtube.com/watch?v=bctjSvn-OC8

giancarlostoro 1 days ago [-]
This sounds a little bit like rkt? Which trims output from other CLI applications like git, find and the most common tools used by Claude. This looks like it goes a little further which is interesting.

I see some of these AI companies adopting some of these ideas sooner or later. Trim the tokens locally to save on token usage.

https://github.com/rtk-ai/rtk

mksglu 1 days ago [-]
Haven't looked at rtk closely but from the description it sounds like it works at the CLI output level, trimming stdout before it reaches the model. Context-mode goes a bit further since it also indexes the full output into a searchable FTS5 database, so the model can query specific parts later instead of just losing them. It's less about trimming and more about replacing a raw dump with a summary plus on-demand retrieval.
giancarlostoro 1 days ago [-]
Yeah I like this approach too. I made a tool similar to Beads and after learning about RTK I updated mine to produce less token hungry output. I'm still working on it.

https://github.com/Giancarlos/guardrails

esperent 1 days ago [-]
Does context mode only work with MCPs? Or does it work with bash/git/npm commands as well?
re5i5tor 22 hours ago [-]
I'm not sure it actually works with MCPs *at all*, trying to get that clarified. How can context-mode get "into the MCP loop"?
re5i5tor 22 hours ago [-]
See my comment above, context-mode has no way to inject itself into the MCP tool-call - response loop.

Still high-value, outside MCPs.

RyanShook 1 days ago [-]
I’m also trying to see which one makes more sense. Discussion about rtk started today: https://news.ycombinator.com/item?id=47189599
ChicagoDave 5 hours ago [-]
Pretty sure you’re losing vital information from cross-session contex. You may be able to work longer in a single session, but Claude will degrade without even the mundane details. It’s like doing math without showing your work.
sarkarsh 10 hours ago [-]
The compression numbers look great but I keep wondering: does the model actually produce equivalent output with compressed context vs full context? Extending sessions from 30min to 3hrs only matters if reasoning quality holds up in hour 2.

esafak's cache economics point is underrated. With prompt caching, verbose context that gets reused is basically free. If compression breaks cache continuity you might save tokens while spending more money.

The deeper issue is that most MCP tools do SELECT * when they should return summaries with drill-down. That's a protocol design problem, not a compression problem.

qeternity 10 hours ago [-]
> With prompt caching, verbose context that gets reused is basically free.

But it's not. It might be discounted cost-wise, however it will still degrade attention and make generation slower/more computationally expensive even if you have a long prefix you can reuse during prefill.

lmeyerov 15 hours ago [-]
We do a fun variant of this for louie.ai when working with database and especially log systems -- think incident response, SRE, devops, outage investigations: instead of returning DB query results to the LLM, we create dataframes (think in-memory parquet). These directly go into responses with token-optimized summary views, including hints like "... + 1M rows", so the LLM doesn't have to drown in logs and can instead decide to drill back into the dataframe more intelligently. Less iterative query pressure on operational systems, faster & cheaper agentic reasoning iterations, and you get a nice notebook back with the interactive data views.

A curious thing about the MCP protocol is it in theory supports alternative content types like binary ones. That has made me curious about shifting much of the data side of the MCP universe from text/json to Apache Arrow, and making agentic harnesses smarter about these just as we're doing in louie.

vishalw007 16 hours ago [-]
As a newbie user that doesn't understand much of this but has claude pro and wants to use it

1. Can this help me? 2. How?

Thanks for sharing and building this.

wener 8 hours ago [-]
Why not use mcp-cli mode, I even made a replica for that https://github.com/wenerme/wode/tree/develop/packages/wener-...
mvkel 1 days ago [-]
Excited to try this. Is this not in effect a kind of "pre-compaction," deciding ahead of time what's relevant? Are there edge cases where it is unaware of, say, a utility function that it coincidentally picks up when it just dumps everything?
mksglu 1 days ago [-]
Yeah it's basically pre-compaction, you're right. The key difference is nothing gets thrown away. The full output sits in a searchable FTS5 index, so if the model realizes it needs some detail it missed in the summary, it can search for it. It's less "decide what's relevant upfront" and more "give me the summary now, let me come back for specifics later."
WesBrownSQL 14 hours ago [-]
I've been running https://github.com/rtk-ai/rtk for a week seems to be a good balance between culling out of context and not just killing everything. I've been running https://github.com/Opencode-DCP/opencode-dynamic-context-pru... in opencode as well. It seems more aggressive.
jnwatson 7 hours ago [-]
Talk about timely. I was just experimenting with a data provider's new MCP server, and I was able to use up my entire Claude Max token limit in under a minute.
theusus 11 hours ago [-]
How is this different than RAG?
unxmaal 1 days ago [-]
I did this accidentally while porting Go to IRIX: https://github.com/unxmaal/mogrix/blob/main/tools/knowledge-...
mksglu 1 days ago [-]
Nice approach. Same core idea as context-mode but specialized for your build domain. You're using SQLite as a structured knowledge cache over YAML rule files with keyword lookup. Context-mode does something similar but domain-agnostic, using FTS5 with BM25 ranking so any tool output becomes searchable without needing predefined schemas. Cool to see the pattern emerge independently from a completely different use case.
ZeroGravitas 1 days ago [-]
I've seen a few projects like this. Shouldn't they in theory make the llms "smarter" by not polluting the context? Have any benchmarks shown this effect?
mksglu 1 days ago [-]
That's the theory and it does hold up in practice. When context is 70% raw logs and snapshots, the model starts losing track of the actual task. We haven't run formal benchmarks on answer quality yet, mostly focused on measuring token savings. But anecdotally the biggest win is sessions lasting longer before compaction kicks in, which means the model keeps its full conversation history and makes fewer mistakes from lost context.
overfeed 18 hours ago [-]
> When context is 70% raw logs and snapshots, the model starts losing track of the actual task

Which frontier model will (re)introduce the radical idea of separating data from executable instructions?

agrippanux 1 days ago [-]
I am a happy user of this and have recommended my team also install it. It’s made a sizable reduction in my token use.
mksglu 1 days ago [-]
Thanks, really appreciate hearing that! Glad it's working well for your team.
tomhow 22 hours ago [-]
HN Mod here. Is the date on the post an error? It says Feb 2025 but the project seems new. I initially went to put a date reference on the HN title but then realised it's more likely a mistake on your post.
doctorpangloss 4 hours ago [-]
His post, code and all the replies here are LLM authored and don't make any sense. He has no idea why his Claude Code instance wrote Feb 2025 instead of Feb 2026. I mean all his results are placebos or nonsense. I can also start new conversations with only 2% of the context in it, or you can call compact, it will all work better. The post has to be flagged.
killingtime74 21 hours ago [-]
Thanks for this. I do most of my work in subagents for better parallelization. Is it possible to have it work there? Currently the stats say subagents didn't benefit from it.
dave_meshimize 11 hours ago [-]
Would be interested to know if this architecture facilitates dynamic context injection from external knowledge sources without inflating the payload again.
esafak 1 days ago [-]
If this breaks the cache it is penny wise, pound foolish; cached full queries have more information and are cheap. The article does not mention caching; does anyone know?

I just enable fat MCP servers as needed, and try to use skills instead.

mksglu 1 days ago [-]
It doesn't break the cache. The raw data never enters the conversation history, so there's nothing to invalidate. A short summary goes into context instead of the full payload, and the model can search the full data from a local FTS5 index if it needs specifics later. Cache stays intact because you're just appending smaller messages to the conversation.
clouedoc 14 hours ago [-]
On here: https://cc-context-mode.mksg.lu/#/3/0/3

> Bun auto-detected for 3–5x faster JS/TS execution

This is quite a claim, and even so, doesn't matter since the bottleneck is the LLM and not the JS interpreter. It's a nit, but little things like this just make the project look bad overall. It feels like nobody took the time to read the copy before publishing it.

More importantly, the claimed 98% context savings are noise without benchmarks of harness performance with and without "context mode".

I'm glad someone is working on this, but I just feel like this is not a serious solution to the problem.

BeetleB 18 hours ago [-]
> With 81+ tools active,

I see your problem.

monkpit 18 hours ago [-]
“you’re holding it wrong” - ok, or we could make it better
afro88 16 hours ago [-]
Sometimes people are actually holding it wrong though
agentifysh 16 hours ago [-]
interesting...this shoudl work with codex too right ?
jamiecode 13 hours ago [-]
[dead]
SignalStackDev 1 days ago [-]
[dead]
aplomb1026 1 days ago [-]
[dead]
formvoltron 1 days ago [-]
[flagged]
jamiecode 2 days ago [-]
[dead]
mksglu 2 days ago [-]
No magic — standard Unix process inheritance. Each execute() spawns a child process via Node's child_process.spawn() with a curated env built by #buildSafeEnv (https://github.com/mksglu/claude-context-mode/blob/main/cont...). It passes through an explicit allowlist of auth vars (GH_TOKEN, AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS, KUBECONFIG, etc.) plus HOME and XDG paths so CLI tools find their config files on disk. No state persists between calls — each subprocess inherits credentials from the MCP server's environment, runs, and exits. This works because tools like gh and aws resolve auth on every invocation anyway (env vars or ~/.config files). The tradeoff is intentional: allowlist over full process.env so the sandbox doesn't leak unrelated vars.
poly2it 1 days ago [-]
Two LLMs speaking with each other on HN? Amusing!
tyre 1 days ago [-]
Why are you assuming they’re an LLM? And please don’t say “em dash”.

Note: you’re replying to the library’s author.

dematz 1 days ago [-]
1st comment: 2 day old account, "is the real story here", summary -> comment -> question, general punchiness of style without saying that much. These llms feel like someone said "be an informal hacker news commenter" so they often end with "Curious how" instead of "I'm curious how" or "Worth building" instead of "It's worth building". Not that humans don't do any of this but all of it together in their comment history, you just get a general vibe.

author reply: not as obvious, but for one thing yes literally em dash, their post has 10 em dashes in 748 words, this comment has 2 em dashes in 115 words. Not that em dash = ai, but in the context of a post about AI it seems more likely. And finally, https://github.com/mksglu/claude-context-mode/blob/main/cont... the file the author linked in their own repo does not exist!

(https://github.com/mksglu/claude-context-mode/blob/main/src/... exists but they messed up the link?)

polski-g 1 days ago [-]
The first two sentences of the first two paragraphs of OP are a dead giveaway.
NamlchakKhandro 21 hours ago [-]
are people still injecting mcp into their context ? lmao.

Use skills and cli instead.

medi8r 17 hours ago [-]
Does the skill run in a subagent, saving context?
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 23:02:04 GMT+0000 (Coordinated Universal Time) with Vercel.