NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
The M×N problem of tool calling and open-source models (thetypicalset.com)
zbentley 1 days ago [-]
The key part of the article is that token structure interpretation is a training time concern, not just an input/output processing concern (which still leads to plenty of inconsistency and fragmentation on its own!). That means both that training stakeholders at model development shops need to be pretty incorporated into the tool/syntax development process, which leads to friction and slowdowns. It also means that any current improvements/standardizations in the way we do structured LLM I/O will necessarily be adopted on the training side after a months/years lag, given the time it takes to do new-model dev and training.

That makes for a pretty thorny mess ... and that's before we get into disincentives for standardization (standardization risks big AI labs' moat/lockin).

evelant 1 days ago [-]
I guess I fail to see why this is such a problem. Yes it would be nice if the wire format were standardized or had a standard schema description, but is writing a parser that handles several formats actually a difficult problem? Modern models could probably whip up a "libToolCallParser" with bindings for all popular languages in an afternoon. Could probably also have an automated workflow for adding any new ones with minimal fuss. An annoyance, yes, but it does not seem like a really "hard" problem. It seems more of a social problem that open source hasn't coalesced around a library that handles it easily yet or am I missing something?
HarHarVeryFunny 1 days ago [-]
There already exist products like LiteLLM that adapt tool calling to different providers. FWIW, incompatibility isn't just an opensource problem - OpenAI and Anthropic also use different syntax for tool registration and invocation.

I would guess that lack of standardization of what tools are provided by different agents is as much of a problem as the differences in syntax, since the ideal case would be for a model to be trained end-to-end for use with a specific agent and set of tools, as I believe Anthropic do. Any agent interacting with a model that wasn't specifically trained to work with that agent/toolset is going to be at a disadvantage.

jeremyjh 1 days ago [-]
Presumably the hosting services are resolving all of this in their OpenAI/Anthropic compatibility layer, which is what most tools are using. So this is really just a problem for local engines that have to do the same thing but are expected to work right away for every new model drop.
remilouf 1 days ago [-]
Author here. You're right, it's not a hard problem, but a particularly annoying one.
giantrobot 1 days ago [-]
Maybe they could vibe code some sort of, I don't know, a Web Service Description Language. That could describe how to interact with a service.
Leon8090 1 days ago [-]
[dead]
airstrike 1 days ago [-]
One of the most relevant posts about AI on HN this year. It's not hype-y, but it's imperative to discuss.

I find it strange that the industry hasn't converged in at least somewhat standardized format, but I guess despite all the progress we're still in the very early days...

gertlabs 22 hours ago [-]
In our benchmarks we exclusively use a custom harness for measuring tool capability. It has common tools that any harness would have, like a thin wrapper around shell commands, basic file editors, etc. but an important part of agentic intelligence is adapting to new tools. Frontier models are already quite adaptable, especially Anthropic models, and improving with each release. I think a standardized format will become less and less important over time.

Benchmarks at https://gertlabs.com

imtringued 10 hours ago [-]
This is backwards. If you think the models are capable of adapting to any format, they will have an easier time adapting to more popular and more common formats until they will eventually become de-facto standards.

The only case where a standard wouldn't win is the case where models are only capable of supporting the baked in format but even this could be solved by adopting a standard format.

HarHarVeryFunny 1 days ago [-]
It's not that strange - the industry wants customer lock-in, not commodification.
kami23 1 days ago [-]
Sounds like we need another standard. /s

This is one of the first tech waves where I feel like I'm on the very very groundfloor for a lot of exploration and it only feels like people have been paying closer attention in the last year. I can't imagine too many 'standard' standards becoming a standard that quickly.

It's new enough that Google seems to be throwing pasta against the wall and seeing what products and protocols stick. Antigravity for example seems too early to me, I think they just came out with another type of orchestrator, but the whole field seems to be exploring at the same time.

Everyone and their uncle is making an orchestrator now! I take a very cautious approach lately where I haven't been loading up my tools like agents, ides, browsers, phones with too much extra stuff because as soon as I switch something or something new comes out that doesn't support something I built a workflow around the tool either becomes inaccessible to me, or now a bigger learning curve than I have the patience for.

I've been a big proponent of trying to get all these things working locally for myself (I need to bite the bullet on some beefy video cards finally), and even just getting tool calls to work with some qwen models to be so counterintuitive.

jrochkind1 1 days ago [-]
Depending on a vendors market position, they may not want to make it easy to switch, which is what standards do, no?
jonathanhefner 1 days ago [-]
Does anyone know why there hasn’t been more widespread adoption of OpenAI’s Harmony format? Or will it just take another model generation to see adoption?
refulgentis 24 hours ago [-]
It's a good question, opinionated* answer: it's the whackiest one by far. I'm not sure it's actually good in the long run. It's very much more intense than the other formats, and idk how to describe this, but I think it puts the model in a weird place where it has to think in this odd framework of channels, and the channel names also shade how it thinks about what it's doing.

It's less of a problem than I'm making it sound, obviously the GPTs are doing just fine. But the counterexample of not having such a complex and unique format and still having things like parallel tool calls has also played out just fine.

When I think on it, the incremental step that made the more classical formats work might have been them shifting towards the model having tokens like <parameter=oldText>...</parameter><parameter=newText>...</parameter> helped a ton, because you could shift to json-ifying stuff inside the parameters instead of having LLM do it.

Also fwiw, the lore on harmony was Microsoft pushed it on them to avoid issues with 2023 Bing and prompt injection and such. MS VP for Bing claimed this so not sure how true it is - not that he's unreliable, he's an awesome guy, just, language is loose. Maybe he meant "concept of channels" and not Harmony in toto. Pointing it out because it may be an indicator it was rushed and over-designed, which would explain it's relative complexity compared to ~anyone else.

* I hate talking about myself, but hate it less than being verbose and free-associating without some justification of relevant knowledge: quit Google in late 2022 to build a Flutter all-platform LLM client, based on llama.cpp / any 3rd party provider you can think of. Had to write Harmony parsing twice, as well as any other important local model format you can think of.

Witty0Gore 1 days ago [-]
Useful article, I was fighting with GLM's tool calling format just last night. Stripping and sanitization to make it compatible with my UI consistently has been... fun.
anerli 17 hours ago [-]
In my experience it's actually very doable to do reliable tool calling with a generic response format across models. You just need to disable native tool calling completely and provide a clearly defined response/tool format that conforms well to pretraining across a variety of models (e.g. XML-like syntaxes).

For example: ``` <think>Let me take a look at that</think> <read path="foo.txt"/> ```

The hard part is building a streaming XML parser that handles these responses robustly, can adjust for edge cases, and normalizes predictable mishaps in history in order to ensure continued response format adherance.

all2 1 days ago [-]
I wonder if stuffing tool call formatting into an engram layer (see Deepseek's engram paper) that could be swapped at runtime would be a useful solution here.

The idea would be to encode tool calling semantics once on a single layer, and inject as-needed. Harness providers could then give users their bespoke tool calling layer that is injected at model load-time.

Dunno, seems like it might work. I think most open source models can have an engram layer injected (some testing would be required to see where the layer best fits).

nasreddin 20 hours ago [-]
The engram idea is actually technically clever but imo sees the solution from a bottom-up approach while Louf's real argument is a top-down view. His solution (declarative specs) solves that by centralizing the spec, making it versioned and composable, independent of any actual model.

Engram layers just move the coordination problem earlier and lock it in. Coordination problems between models & providers would still exist, requiring a layer injection in each open source model and another variant produced for each. Users would still need to chose between "Qwen-8b" and "Qwen-8b-engram" x model families and sizes. Is that cleaner?

all2 6 hours ago [-]
Fair point. I don't know if it is cleaner or not.

The issue with a top-level spec, that I can see, is that models fall back to their training when it comes to tools. This is why I recommended the engram approach, because as far as I can tell the problem is a model problem not a systems problem.

R00mi 1 days ago [-]
MCP is the wire format between agent and tool, not the format the model itself uses to emit the call. That part (Harmony, JSON, XML-ish) is still model-specific. So the M×N the article describes is really two problems stacked — MCP only solves the lower half.

Also in practice Claude Code, Cursor and Codex handle the same MCP tool differently — required params, tool descriptions, response truncation. So MCP gives you the contract but the client UX still leaks.

hedgehog 1 days ago [-]
But, like pancakes, usually the stack is described as building bottom-up. Can you relate the individual components to ingredients in a diner-style pancake breakfast?
R00mi 5 hours ago [-]
Bottom-up preparation :)

- Plate — your product

- Pancake 1 — the model

- Syrup #1 — how the model emits the tool call (Harmony, JSON, XML-ish). Different flavor at every table.

- Pancake 2 — the client/agent

- Syrup #2 — MCP. Same brand at every table, finally.

- Pancake 3 — the MCP server (the tool)

Two syrups. MCP standardized the top one. The bottom one is still BYO at every table. That's the article. What do you think about that?

HarHarVeryFunny 23 hours ago [-]
Bottom line is that MCP doesn't change anything in the way the model discovers and invokes tools, so MCP doesn't help with the issue of lack of standard tool call syntax.

1) The way basic non-MCP tool use works is that the client (e.g. agent) registers (advertises) the tools it wants to make available to the model by sending an appropriate chunk of JSON to the model as part of every request (since the model is stateless), and if the model wants to use the tool then it'll generate a corresponding tool call chunk of JSON in the output.

2) For built-in tools like web_search the actual implementation of the tool will be done server-side before the response is sent back to the client. The server sees the tool invocation JSON in the response, calls the tool and replaces the tool call JSON with the tool output before sending the updated response back to the client.

3) For non-built-in tools such as the edit tool provided by a coding agent, the tool invocation JSON will not be intercepted server-side, and is instead just returned as-is to the client (agent) as part of the response. The client now has the responsibility of recognizing these tool invocations and replacing the invocation JSON with the tool output the same as the server would have done for built-in tools. The actual "tool call" can be implemented by the client however it likes - either internal to the client to by calling some external API.

4) MCP tools work exactly the same as other client-provided tools, aside from how the client learns about them, and implements them if the model chooses to use them. This all happens client side, with the server/model unaware that these client tools are different from any others it is offering. The same JSON tool registration and JSON tool call syntax will be used.

What happens is that client configuration tells it what MCP servers to support, and as part of client initialization the client calls each MCP server to ask what tools it is providing. The client then advertises/registers these MCP tools it has "discovered" to the model in the normal way. When the client receives a tool call in the model response and sees that it is an MCP provided tool, then it knows it has to make an MCP call to the MCP server to execute the tool call.

TL/DR

o the client/agent talks standard MCP protocol to the MCP servers

o the client/agent talks model-specific tool use protocol to the model

logotype 23 hours ago [-]
There’s a lot of wasted compute with stateless inference. We set out to solve that with a new computational model for transformers, and only process the delta between requests. That’s how we can achieve crazy low latency tool calling with LayerScale. Check it out https://layerscale.ai and technical whitepapers out next month!
hashmap 1 days ago [-]
The native way to skip all that is train a small thingy to map hidden state -> token/thingy you care about once per model family, or just do it once and procrustes over the state from the model you're using to whatever you made the map for.
alienbaby 1 days ago [-]
In Greek mythology, Procrustes (/proʊˈkrʌstiːz/; Greek: Προκρούστης Prokroustes, "the stretcher [who hammers out the metal]"), also known as Prokoptas, Damastes (Δαμαστής, "subduer") or Polypemon, was a rogue smith and bandit from Attica who attacked people by stretching them or cutting off their legs, so as to force them to fit the size of an iron bed

I can't figure out if you meant that or not, it kinda fits. (No pun intended)

hashmap 1 days ago [-]
well yes and no, i meant https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem which, yes, is named for that stretchster
barumrho 19 hours ago [-]
Is there a reason why this has to be done at training time? Could the system prompt tell the model to convert the output to a different format?
kleton 1 days ago [-]
Don't inference servers like vllm or sglang just translate these things to openai-compat API shapes?
ethan_smith 1 days ago [-]
They do, but that's kind of the article's point - someone still has to write and maintain the per-model chat template and tool call parsing inside vllm/sglang. Every time a new model ships with a slightly different format, the inference server needs an update. The M×N problem doesn't disappear, it just gets pushed one layer down.
zhangchen 19 hours ago [-]
[dead]
Nevermark 1 days ago [-]
Feedback: I don't usually comment on formatting, but that fat indent is too much. I applied "hide distracting items" to the graphic, and the indent is still there. Reader works.
0xnadr 1 days ago [-]
This is a real problem. The function calling format fragmentation across models makes it painful to build anything provider-agnostic.
seamossfet 1 days ago [-]
Great article, but your site background had me trying to clean my laptop screen thinking I splashed coffee on it.
remilouf 1 days ago [-]
Ooops sorry
ikidd 1 days ago [-]
This sounds like a problem that LLMs were built to solve.
Havoc 1 days ago [-]
Not fast enough and increases attack surface
ontouchstart 1 days ago [-]
goodmythical 1 days ago [-]
Clicking that directly yields: "hi orange site user, i'd prefer my stuff to stay off the radar of this particular community."
ontouchstart 1 days ago [-]
Thanks. This is so hilarious ;-)

https://mariozechner.at/nothanks.html

I didn't see it on mobile. So it only happened to desktop browser.

I only found out via pi myself:

> pi --continue -p "Check the link and see if there is a banner to turn back users from HN community"

Goodmythical’s comment was *accurate at the time it was written* – the link did trigger the “no‑thanks” page when it was opened from Hacker News. The “banner” is not a visual element that lives on the main article page; it is the content of the separate *`/nothanks.html`* file that the site redirects to.

When the redirect was in place, the user experience was:

1. User clicks the link while still on `news.ycombinator.com`. 2. The script in `components.js` sees the referrer and redirects the browser to `/nothanks.html`. 3. The `/nothanks.html` page displays the single line “hi orange site user …” – this is what Goodmythical described as the banner.

If you now visit the same link directly (e.g., from a bookmark or a search engine) the redirect is bypassed and you see the normal article, so you won’t see that page at all.

casey2 13 hours ago [-]
Ironically LLMs solve the MxN problem he's complaining about. He wants to get rid of the problem entirely, but fails to see the value of pointless differences.

It's the same kind of hubris that asks why we don't all speak one language. In the future we will all speak one language and we will all speak either our own or a DSL shared by only a few others, in America we will all speak English, in Japan even the torists will all speak Japanese. Very few will know English, but some will know it better than anyone.

remilouf 12 hours ago [-]
> Ironically LLMs solve the MxN problem he's complaining about

Enlighten me please

jiehong 1 days ago [-]
Am I misunderstanding, or isn't this supposed to be the point of MCP?
akoumjian 1 days ago [-]
The models only output text. Tool calls are nothing more than specially formatted text which gets parsed and interpreted by the inference server (or some other driver) into something which can be picked up by your agent loop and executed. Models are trained in a wide variety of different delimiters and escape characters to indicate their tool calls (along with things like separate thinking blocks). MCP is mostly a standard way to share with your agent loop the list of tool names and what their arguments are, which then gets passed to the inference server which then renders it down to text to feed to the model.
perlgeek 1 days ago [-]
> Tool calls are nothing more than specially formatted text which gets parsed and interpreted by the inference server

I know this is getting off-topic, but is anybody working on more direct tool calling?

LLMs are based on neural networks, so one could create an interface where activating certain neurons triggers tool calls, with other neurons encoding the inputs; another set of neurons could be triggered by the tokenized result from the tool call.

Currently, the lack of separation between data and metadata is a security nightmare, which enables prompt injection. And yet all I've seen done about is are workarounds.

yorwba 1 days ago [-]
Each text token already represents the activation of certain neurons. There is nothing "more direct." And you cannot fully separate data and metadata if you want them to influence the output. At best you can clearly distinguish them and hope that this is enough for the model to learn to treat them differently.
perlgeek 1 days ago [-]
Are there tokens reserved for tool calls? If yes, I can see the equivalence. If not, not so much.
yorwba 1 days ago [-]
Yes, typically the tags used for tool calls get their own special tokens, e.g. https://huggingface.co/google/gemma-4-E4B-it/blob/main/token...
dontlikeyoueith 1 days ago [-]
> LLMs are based on neural networks, so one could create an interface where activating certain neurons triggers tool calls, with other neurons encoding the inputs; another set of neurons could be triggered by the tokenized result from the tool call.

You can do this. It's just sticking a different classifier head on top of the model.

Before foundation models it was a standard Deep RL approach. It probably still is within that space (I haven't kept up on the research).

You don't hear about it here because if you do that then every use case needs a custom classifier head which needs to be trained on data for that use case. It negates the "single model you can use for lots of things" benefit of LLMs.

zbentley 1 days ago [-]
I'm a novice in this area, but my understanding is that LLM parameters ("neurons", roughly?), when processed, encode a probability for token selection/generation that is much more complex and many:one than "parameter A is used in layer B, therefore suggest token C", and not a specific "if activated then do X" outcome. Given that, how would this work?
agent-kay 1 days ago [-]
[dead]
jeremie_strand 1 days ago [-]
[dead]
kantaro 21 hours ago [-]
[dead]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 20:39:44 GMT+0000 (Coordinated Universal Time) with Vercel.