Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲All your agents are going async (zknill.io)

127 points by zknill 3 days ago | 77 comments

edg5000 1 days ago [-]

There is nothing wrong with the HTTP layer, it's just a way to get a string into the model.

The problem is the industry obsession on concatenating messages into a conversation stream. There is no reason to do it this way. Every time you run inference on the model, the client gets to compose the context in any way they want; there are more things than just concatenating prompts and LLM ouputs. (A drawback is caching won't help much if most of the context window is composed dynamically)

Coding CLIs as well as web chat works well because the agent can pull in information into the session at will (read a file, web search). The pain point is that if you're appending messages a stream, you're just slowly filling up the context.

The fix is to keep the message stream concept for informal communication with the prompter, but have an external, persistent message system that the agent can interact with (a bit like email). The agent can decide which messages they want to pull into the context, and which ones are no longer relevant.

The key is to give the agent not just the ability to pull things into context, but also remove from it. That gives you the eternal context needed for permanent, daemonized agents.

vanviegen 23 hours ago [-]

I've been working on a coding agent that does this on and of for about a year. Here's my latest attempt: https://github.com/vanviegen/maca#maca - This one allows agents to request (and later on drop) 'views' on functions and other logical pieces of code, and always get to see the latest version of it. (With some heuristics to not destroy kv-caches at every turn.)

The problem is that the models are not trained for this, nor for any other non-standard agentic approach. It's like fighting their 'instincts' at every step, and the results I've been getting were not great.

mncharity 17 hours ago [-]

> allows agents to request (and later on drop) 'views' on functions and other logical pieces of code [...] The problem is that the models are not trained for this

Fwiw, I was playing with an "outliner"-tool collapse/expand idiom, on synthetic literate-programming markdown files, with #ids on headers and blocks. Insufficient experience to suggest it works, but it wasn't obviously not working, and that with a non-frontier model and very little guidance. Other familiar related idioms include <details>/<summary>, hierarchical breadcrumbs, and plan9-ish synthetic filesystems `foo.c/f.{c,dataflow,etc}`. One open question was comfort with more complex visibility transformations or sets - "hide #bar; show 2 levels of headers-only under #hee; ...". Another was cleanup - recognition of "I no longer need this and that".

edg5000 20 hours ago [-]

So we agree on a message system having potential. But why the vectors? In any case, interesting stuff.

vanviegen 16 hours ago [-]

I'm using vector embeddings for creating code views based on semantic search, initially based on the user prompt. That really works wonders to give the agent a flying start.

zknill 24 hours ago [-]

> "and which ones are no longer relevant."

This is absolutely the hardest bit.

I guess the short-cut is to include all the chat conversation history, and then if the history contains "do X" followed by "no actually do Y instead", then the LLM can figure that out. But isn't it fairly tricky for the agent harness to figure that out, to work out relevancy, and to work out what context to keep? Perhaps this is why the industry defaults to concatenating messages into a conversation stream?

edg5000 4 hours ago [-]

My guess (I will test this eventually) is that you set a window size (which may be the model limit, or lower to reduce input token costs), the harness then refuses to show items that don't fit. If the model emits a command to read a file, the harness then says "File hidden due to lack of context space". In the system prompt, the model is informed about the context space usage, and that it can hide files. It needs to be instructed that if files contain something noteworthy, that the agent notes this down in their notes, which should always be rendered into the context. If this fails, the agent will hide a file with relevant information and then get lost in circles. If it succeeds, the agent can work on larger tasks autonomously. So it's worth trying.

asixicle 23 hours ago [-]

That's what the embedding model is for. It's like a tack-on LLM that works out the relevancy and context to grab.

nprateem 22 hours ago [-]

God knows why you think this is possible. If I don't even know what might be relevant to the conversation in several turns, there's no way an agent could either.

asixicle 22 hours ago [-]

One of us is confusing prediction with retrieval. The embedding model doesn't predict what is going to be relevant in several turns, just on the turn at hand. Each turn gets a fresh semantic search against the full body of memory/agent comms. If the conversation or prompt changes the next query surfaces different context automatically.

As you build up a "body of work" it gets better at handling massive, disparate tasks in my admittedly short experience. Been running this for two weeks. Trying to improve it.

edg5000 4 hours ago [-]

So the embedding model is a fixed-size view on a arbitrarily sized work history (tool calls, natural language messages)? The model is like a summarizer, but in latent space? And not aimed to summarize, but trained to hold whatever is needed for the agent to be autonomous for longer runs?

vdelpuerto 22 hours ago [-]

[flagged]

sourcecodeplz 23 hours ago [-]

Yeah, opencode was/is like this and they never got caching right. Caching is a BIG DEAL to get right.

edg5000 20 hours ago [-]

Now I see why Anthropic isn't too happy with third party clients. The clients may not be so nice to their capacity as their own client, which has the interests aligned with minimum token consumption. A tricky dynamic.

evenhash 11 hours ago [-]

> There is nothing wrong with the HTTP layer, it's just a way to get a string into the model.

I know you don’t mean it in a reductive sense, but it’s funny /sad that I can imagine

“HTTP is just a way to get a string into a model”

becoming a real piece of wisdom unironically dispensed on this site in the future. Maybe it already is.

alehlopeh 21 hours ago [-]

As you noted briefly, a big drawback is not getting to take advantage of the cache. Seems like a pretty big drawback.

edg5000 21 hours ago [-]

Yes, it will destroy most of the caching potential. On the other hand, the average context window needed to achieve the same type of task may be much smaller. This might make up for it. And with a better harness, fewer rounds may be needed. Plus, hopefully costs will go down. There is a lot of hope in this comment though.

raincole 22 hours ago [-]

> The key is to give the agent not just the ability to pull things into context, but also remove from it

Of course Anthropic/OpenAI can do it. And the next day everyone will be complaining how much Claude/Codex has been dumbed down. They don't even comply to the context anymore!

zozbot234 18 hours ago [-]

> Every time you run inference on the model, the client gets to compose the context in any way they want; there are more things than just concatenating prompts and LLM ouputs.

You can always launch a subagent with a fresh context. There are further things that you could do by tweaking the underlying transformer model (such as "joining" any number of independently cached contexts together on an equal basis, without having to rerun prefill on the "later" contexts) but this is quite general already.

zahlman 21 hours ago [-]

> the industry obsession

Or maybe they haven't thought about it?

Or they tried some simple alternatives and didn't find clear benefits?

> The key is to give the agent not just the ability to pull things into context, but also remove from it.

But then you need rules to figure out what to remove. Which probably involves feeding the whole thing to a(nother?) model anyway, to do that fuzzy heuristic judgment of what's important and what's a distraction. And simply removing messages doesn't add any structure, you still just have a sequence of whatever remains.

edg5000 20 hours ago [-]

What I'm thinking is: When the agent wants to open more files or open more messages, eventually there will be no more context left. The agent is then essentially forced to hide some files and messages in order to be able to proceed. Any other commands are refused until the agent makes room in the context. Maybe the best models will be able to handle this responsibility. A bad model will just hide everything and then forgot what they were working on.

ljm 19 hours ago [-]

A smalltalk or Erlang for AI agents is an interesting thought. Smalltalk for the design in terms of message passing and object-oriented holding of state (agents are stateful and are reached via their public interfaces), Erlang for the elegant execution of it with actors and mailboxes (agents have inboxes and outboxes and can work concurrently at scale). Might as well go the whole hog and put a supervisor AI agent in as a switchboard.

asixicle 23 hours ago [-]

To be utterly shameless, this what I've been building: https://github.com/ASIXicle/persMEM

Three persistent Claude instances share AMQ with an additional Memory Index to query with an embedding model (that I'm literally upgrading to Voyage 4 nano as I type). It's working well so far, I have an instance Wren "alive" and functioning very well for 12 days going, swapping in-and-out of context from the MCP without relying on any of Anthropic's tools.

And it's on a cheap LXC, 8GB of RAM, N97.

handfuloflight 22 hours ago [-]

Why is shame a factor at all in sharing your work?

asixicle 22 hours ago [-]

Good point. I guess because I'm new here I'm not positive on the decorum-policy for self-promotion.

I just make stuff to share with others, so yeah, good point.

17 hours ago [-]

altruios 18 hours ago [-]

Context is going to be the next big advancement.

When a model is trained on multi-contexts, some growing over time like we see now (conversations), some rolling at various sizes (as in, always on), such as a clock, video feed, audio feed, data streams, tool calling, we no longer have to 'pollute' the main context with a bunch of repetitive data.

But this is going in the direction of 1agent=1mind. When much more likely human (and maybe all cognition) requires 'ghosts' and sub processes. It is much more likely an agent is more like a configurable building piece to a(n alien) mind.

ElFitz 24 hours ago [-]

Hmm.

Maybe there’s a way to play around with this idea in pi. I’ll dig into it.

bozdemir 20 hours ago [-]

Solid post but half of this reads like a pitch deck for Ably with extra steps. The disclosure would have landed better at the top than buried two paragraphs from the end. The transport isn't really the problem though. The interaction model is. When my cron agent finishes a task at 3am I don't want a live session I can rejoin, I want it to drop a message in Slack or email me and shut up. The scenarios you list (agent outlives caller, unprompted push, caller changes device) are all solved in prod today with pubsub plus a notification provider, and HTTP handles that just fine. Durable session as a first class primitive is cool, but it's a nice-to-have for a narrow slice of work, not a prerequisite for "all your agents going async." The page-refresh-kills-the-chatbot point is fair, but SSE with Last-Event-ID and proper event sourcing gets you 80 percent of what you're describing without inventing a new primitive. Trading systems and chat apps have been doing this for 15 years. WebSockets with resumable streams aren't some unsolved frontier. Where you do have a real point is multi-human collaborative sessions with token streaming across devices. That genuinely is awkward on pure request-response. But framing it as "the transport is broken" oversells it. Most async agents are fire and forget with a webhook at the end, and they're fine.

alansaber 18 hours ago [-]

Agree with this, the problem isn't a technical one, it's UX.

_pdp_ 1 days ago [-]

Here is an interesting find.

Let's say that you have two agents running concurrently: A & B. Agent A decides to push a message into the context of agent B. It does that and the message ends up somewhere in the list of the message right at the bottom of the conversation.

The question is, will agent B register that a new message was inserted and will it act on it?

If you do this experiment you will find out that this architecture does not work very well. New messages that are recent but not the latest have little effect for interactive session. In other words, Agent A will not respond and say, "and btw, this and that happened" unless perhaps instructed very rigidly or perhaps if there is some other instrumentation in place.

Your mileage may vary depending on the model.

A better architecture is pull-based. In other words, the agent has tools to query any pending messages. That way whatever needs to be communicated is immediately visible as those are right at the bottom of the context so agents can pay attention to them.

An agent in that case slightly more rigid in a sense that the loop needs to orchestrate and surface information and there is certainly not one-size-fits-all solution here.

I hope this helps. We've learned this the hard way.

sudosteph 23 hours ago [-]

Yep, I didn't want to have to think about concurrency so my solution was a global lock file on my VM that gets checked by a pre-start hook in claude code. Each of my "agents" is a it's own linux user with their own CLAUDE.md, and there is a changelog file that gets injected into that each time they launch. They can update the changelog themselves, and one agent in particular runs more frequently to give updates to all of them. Most of it is just initiated by cron jobs. This doesn't scale infinitely, but if you stick to two-pizza teams per VM it will still be able to do a lot.

So hooks are your friends. I also use one as a pre flight status check so it doesn't waste time spinning forever when the API has issues.

aledevv 1 days ago [-]

> All of these features are about breaking the coupling between a human sitting at a terminal or chat window and interacting turn-by-turn with the agent.

This means:

- less and less "man-in-the-loop"

- less and less interaction between LLMs and humans

- more and more automation

- more and more decision-making autonomy for agents

- more and more risk (i.e., LLMs' responsibility)

- less and less human responsibility

Problem:

Tasks that require continuous iteration and shared decision-making with humans have two possible options:

- either they stall until human input

- or they decide autonomously at our risk

Unfortunately, automation comes at a cost: RISK.

canarias_mate 16 hours ago [-]

[dead]

dist-epoch 1 days ago [-]

AI driven cars have better risk profiles than humans.

Why do you think the same will not also be true for AI steerers/managers/CEO?

In a year of two, having a human in the loop, will all of their biases and inconsistencies will be considered risky and irresponsible.

khafra 23 hours ago [-]

"Did the vehicle just crash" has a short feedback loop, very amenable to RL. "Did this product strategy tank our earnings/reputation/compliance/etc" can have a much longer, harder to RL feedback loop.

But maybe not that much longer; METR task length improvement is still straight lines on log graphs.

dist-epoch 23 hours ago [-]

The AI has read all the business books, blogs and stories.

Unless your CEO is Steve Jobs, it's hard to imagine it being much worse than your average pointy haired boss.

rapind 23 hours ago [-]

> The AI has read all the business books, blogs and stories.

This seems like a liability as most business books, blogs, and stories are either marketing BS or gloss over luck and timing.

> Unless your CEO is Steve Jobs, it's hard to imagine it being much worse than your average pointy haired boss.

As someone using AI agents daily, this is actually incredible really easy to imagine. It's actually hard to imagine it NOT being horrible! Maybe that'll change though... if gains don't plateau.

nprateem 22 hours ago [-]

But they are shit. Over the last 2 days I've got bored of the predictable cycle of it first getting excited about a new idea then back peddling once I shoot it to pieces.

They can't write and think critically at the same time. Then subsequent messages are tainted by their earlier nonsensical statements.

Opus 3.7 BTW, not some toy open source model.

jddj 24 hours ago [-]

Getting to that point is likely going to involve a lot of (the business and personal equivalent of) Teslas electing to drive through white semitrailers.

23 hours ago [-]

philipwhiuk 22 hours ago [-]

Or autonomous weapons?

oblio 23 hours ago [-]

> AI driven cars have better risk profiles than humans.

From which company? I hope you say "Waymo", because Tesla is lying through its teeth and hiding crash statistics from regulators.

slfnflctd 19 hours ago [-]

Let's not forget that Waymo requires an extensive, custom mapping and software/pre-training development process for every new city they operate in, are only in 10 cities total after over 20 years, and are still nowhere near profitability (or even with a clear plan to get there as far as I can tell).

I personally believe widely available self-driving cars which don't operate at a loss will continue to elude us until we accept the tradeoffs of dedicated lanes, a standardized vehicle-to-vehicle communication protocol, and roadside sensors. We were lied to.

oblio 18 hours ago [-]

For a fraction of the cost of developing self-driving cars we could have self-driving trains/trams/subways and most likely minibuses as part of public transportation networks.

And self-driving minibuses would basically provide 95% of the benefits of self-driving buses. They could offer 24/7 frequent service with huge coverage, we already have dedicated bus lanes in many places (and we could scale dedicated bus lanes much faster than dedicated self-driving car lanes), etc.

Now, I understand that in many places (especially the US) this is infeasible because public anything = communism.

ChrisLTD 17 hours ago [-]

Folks in the US are happy to spend tax dollars on roads, it's just that mass transit spending is considered communism.

To be fair to the anti-train crowd, we've been led so far down this disastrous path of car-led sprawl that the hope of even building feasible buses that can reach into the byzantine suburbs is unlikely.

So, maybe our best hope is self-driving EVs? At least in our lifetimes.

samoladji 5 hours ago [-]

Good framing. The transport mismatch is real and already causing pain in production. One thing worth adding: the security surface expands significantly when agents go async. When an agent is synchronous, a human is implicitly in the loop on every action. When it's running in the background on a cron or webhook, there's no one watching. The agent can take hundreds of actions before anyone notices something went wrong. The transport problem you're describing is urgent. The governance problem that comes with async agents is equally urgent and almost nobody is talking about it yet.

artisin 24 hours ago [-]

So reinventing terminal multiplexing, except over proprietary chat/realtime transports instead of PTYs?

oblio 23 hours ago [-]

Yeah, but one is free and the other one might make you a billionaire.

If you think about it, about 30% of the biggest businesses out there are based on this exact business idea. IRC - Slack, XMPP & co - the many proprietary messengers out there, etc.

21 hours ago [-]

hardsnow 17 hours ago [-]

I’ve been using email as an async channel with agents. Email does proper long-form async and native threaded communication extremely well and IMO is the best match UX-wise.

The system I’ve developed for this is open source and detailed at https://airut.org

Yokohiii 3 days ago [-]

this is a commercial sales pitch for something that doesn't exist

zknill 1 days ago [-]

I don't think this is quite right. I do work for a pub/sub company that's involved in this space, but this article isn't a commercial sales pitch and we do have a product that exists.

The article is about how agents are getting more and more async features, because that's what makes them useful and interesting. And how the standard HTTP based SSE streaming of response tokens is hard to make work when agents are async.

sudb 19 hours ago [-]

If agents are async, is streaming still important? I think the useful set of interactions with an async agent are pretty limited - you'd want to stop, interrupt with a user message, maybe pause, resume, or steer with a user message?

All of those can be done without needing streams or a session abstraction I think, unless I'm misunderstanding.

philipwhiuk 22 hours ago [-]

> but this article isn't a commercial sales pitch

Yes it is. But it's nice you've convinced yourself I guess.

What is this, if not a product pitch:

> Because we’re building on our existing realtime messaging platform, we’re approaching the same problem that Cloudflare and Anthropic are approaching, but we’ve already got a bi-directional, durable, realtime messaging transport, which already supports multi-device and multi-user. We’re building session state and conversation history onto that existing platform to solve both halves of the problem; durable transport and durable state.

tim-projects 21 hours ago [-]

I feel like this is a case of just because you can doesn't mean you should.

I still sit and watch my terminals. It's the easiest way to catch problems.

probabletrain 18 hours ago [-]

> Looking at the OpenClaw model, where the conversation history is in the chat channel and the agent process and LLM provider are both separated from that, you can’t build the same design on Cloudflare or Anthropic

Yes you can - durable objects do exactly what the "Ably pub/sub channel transport" diagram describes. And it's even easier with the cloudflare agents SDK. This article strawmans the capabilities of competing infra.

skybrian 17 hours ago [-]

This already exists. I’m a happy user of exe.dev VM’s. They have a coding agent called Shelley (https://exe.dev/shelley) that works fine in a web browser on my laptop, tablet, and phone. I can close my laptop at at any time and the agent keeps running in the VM.

It works with multiple LLM’s. The main downside is that since they go through the API, it gets expensive once the monthly quota runs out. (They claim to resell additional API usage at cost, but that doesn’t seem easy to verify.) I’ve switched to using Sonnet for most things but haven’t experimented with cheaper models yet.

It seems like the big price difference between what going through the API costs and what you can get via a subscription is really holding things back.

2001zhaozhao 10 hours ago [-]

Easy.

- The agent and all its state stays on a persistent server that saves state on restart

- Just stream the state directly to the client via websockets, or even the entire UI with something like liveview

OpenClaw has already proven this model and I don't see a great reason to try and solve the problem a different way.

sasipi247 18 hours ago [-]

OpenAI Responses API has WebSocket mode, which can be used instead of SSE, which works very well and feels like a leap forward in terms of performance.

https://developers.openai.com/api/docs/guides/websocket-mode

I have been building on it over the past month holding WebSocket sessions on workers warm, and command routing using NATS JetStream. With this, it has made using sidecar threads for a main thread very simple, as the worker treats them similar.

nexustoken 22 hours ago [-]

Been building a task-dispatch API for a couple months, and the thing that bit me wasn't the async part — it was duplicate work. Two agents an hour apart paying twice for the exact same normalized input. Memory gap, not sync gap.

Once I hashed canonical input JSON, cache hit rate on real traffic was higher than expected — mid-teens % once a handful of workers were live. Curious if anyone here's tried cross-agent result sharing without bolting on a full pub/sub layer.

konovalov-nk 6 hours ago [-]

Pivot to Erlang is real!

I'm kidding of course but feels like the time has come to look closely into Erlang ecosystem and OTP.

There's even agentic framework for this: https://jido.run/blog/jido-2-0-is-here

If you think about it, OTP makes a lot of sense for always-on, reachable agents. Agents need to talk to external systems all the time: web services, databases, message queues, local tools.

More than a year ago, I had the idea of building a personal AI assistant connected to multiple services (https://github.com/konovalov-nk/synaptra/blob/main/docs/arch...). But I didn't want to build yet another over-engineered k8s setup just to get isolation and separation of concerns.

Over time, I realized OTP was much closer to the model I actually wanted.

Why?

Some services want to run locally: memory, low-latency text-to-speech, private data access. The agent can also run locally while delegating work across supervised processes. Things will fail, and that's fine — Erlang was built around exactly that assumption.

Once you look at agents this way, they indeed look less like chat sessions and more like long-lived, supervised, stateful processes.

In that sense, Erlang really was ahead of its time.

Havoc 1 days ago [-]

Struggling with this at the moment too - the second you have a task that is a blend of CI style pipeline, LLM processing and openclaw handing that data back and forth, maintaining state and triggering next step gets tricky. They're essentially different paradigms of processing data and where they meet there are impedance mismatches.

Even if I can string it together it's pretty fragile.

That said I don't really want to solve this with a SaaS. Trying really hard to keep external reliance to a minimum (mostly the llm endpoint)

mettamage 1 days ago [-]

> The interesting thing is what agents can do while not being synchronously supervised by a human.

I vibe coded a message system where I still have all the chat windows open but my agents run a command that finished once a message meant for them comes along and then they need to start it back up again themselves. I kept it semi-automatic like that because I'm still experimenting whether this is what I want.

But they get plenty done without me this way.

sebastiennight 1 days ago [-]

The idea of the "session" is an interesting solution, I'll be looking forward to new developments from you on this.

I don't think it solves the other half of the problem that we've been working on, which is what happens if you were not the one initiating the work, and therefore can't "connect back into a session" since the session was triggered by the agent in the first place.

zknill 1 days ago [-]

With the approach based on pub/sub channels, this is possible to do if you know the name of the session (i.e. know the name of the channel).

Of course the hard bit then is; how does the client know there's new information from the agent, or a new session?

Generally we'd recommend having a separate kind of 'notification' or 'control' pub/sub channel that clients always subscribe to to be notified of new 'sessions'. Then they can subscribe to the new session based purely on knowing the session name.

serbrech 1 days ago [-]

I recognize the problem statement and decomposition of it. But not the solution. Especially saying that he sees the same problem being worked on by N people. And now that makes in N+1? I’ve been more interested by the protocols and standard that could truly solve this for everyone in a cross-compatible way. Some people have dabbled with atproto as the transport and “memory” storage for example.

tuo-lei 20 hours ago [-]

the async transport feels like the wrong layer to optimize. biggest issue i keep running into is agent session state being completely non-portable between tools. Claude Code dumps JSONL, Cursor splits data across SQLite and separate JSONL files, and none of them agree on schema or even what counts as a "turn". you can make the message bus async but if you can't reconstruct what the agent did from its own session data, that's the actual blocker. i'd rather see a shared session format than another pubsub layer.

anamexis 21 hours ago [-]

Maybe I’m missing something, but once you’ve got durable state, don’t you get durable transport more or less “for free” with SSE and Last-Event-ID?

sonink 23 hours ago [-]

I was of the same view - but then there is this other trend which is putting sync back in favor. And that is that agents are becoming faster. If they are faster - it makes sense to stick around and maintain your 'context' about the task and supervise in real time. The other thing which might keep sync in fashion is that LLM providers are cutting back on cheap tokens. So you have a bigger incentive to stick around and make sure that your agent is not going astray.

The only place I use async now is when I am stepping away and there are a bunch of longer tasks on my plate. So i kick them off and then get to review them when ever I login next. However I dont use this pattern all that much and even then I am not sure if the context switching whenever I get back is really worth it.

Unless the agents get more reliable on long horizon tasks, it seems that async will have limited utility. But can easily see this going into videos feeding the twitter ai launch hype train.

htahir111 1 days ago [-]

How would you differentiate between other tools like Temporal or Kitaru (https://kitaru.ai/) ?

zknill 1 days ago [-]

I don't know Kitaru too well, but I do know Temporal a bit.

The pattern I describe in the article of 'channels' works really well for one of the hardest bits of using a durable execution tool like Temporal. If your workflow step is long running, or async, it's often hard to 'signal' the result of the step out to some frontend client. But using channels or sessions like in the article it becomes super easy because you can write the result to the channel and it's sent in realtime to the subscribed client. No HTTP polling for results, or anything like that.

htahir111 23 hours ago [-]

so to be clear, this should be used "instead of" rather then "on top of" durable execution engines?

sudb 19 hours ago [-]

I think this post ignores, deliberately or not, the large group of async coding agents that have been GA since around early 2025 - probably the most well-known of which is Devin (which has been around since 2024, but not available to the public).

As an aside, I've built and deployed a production system in which disconnecting & reconnecting from an in-progress LLM stream works and resumes from wherever the stream currently is, through a combination of redis/valkey & websockets - it's not all that hard, it turns out!

TacticalCoder 1 days ago [-]

> ... and streaming the tokens back on the HTTP response as an SSE stream

> So how are folks solving this?

$5 per month dedicated server, SSH, tmux.

verdverm 19 hours ago [-]

If you build a coding agent on Google's ADK, it's designed for this background processing setup. It will transparently save the sessions and events, leaving it up to you what should be sent to the interface. Great framework, happy user with my personal agent stack

scotty79 22 hours ago [-]

It seems that people started spontaneously using chat apps (telegram and such) for durable channel between them and their async agents.

Maybe better somebody standardize that because we'll end up with agents sending rich payloads between themselves via telegram.

dist-epoch 24 hours ago [-]

Can anybody explain why many times if you switch away from the chat app on the phone, the conversation can get broken?

Having long living requests, where you submit one, you get back a request_id, and then you can poll for it's status is a 20 year old solved problem.

Why is this such a difficult thing to do in practice for chat apps? Do we need ASI to solve this problem?

zknill 23 hours ago [-]

I suspect the answer is that the AI chat-app is built so that the LLM response tokens are sent straight into the HTTP response as a SSE stream, without being stored (in their intermediate state) in a database. BUT the 'full' response _is_ stored in the database once the LLM stream is complete, just not the intermediate tokens.

If you look at the gifs of the Claude UI in this post[1], you can see how the HTTP response is broken on page refresh, but some time later the full response is available again because it's now being served 'in full' from the database.

[1]: https://zknill.io/posts/chatbots-worst-enemy-is-page-refresh...

petesergeant 1 days ago [-]

at https://agentblocks.ai we just use Google-style LROs for this, do we really need a "durable transport for AI agents built around the idea of a session"?

zknill 1 days ago [-]

Assuming LROs are "Long running operations", then you kick off some work with an API request, and get some ID back. Then you poll some endpoint for that ID until the operation is "done". This can work, but when you try and build in token-streaming to this model, you end up having to thread every token through a database (which can work), and increasing the latency experienced by the user as you poll for more tokens/completion status.

Obviously polling works, it's used in lots of systems. But I guess I am arguing that we can do better than polling, both in terms of user experience, and the complexity of what you have to build to make it work.

If your long running operations just have a single simple output, then polling for them might be a great solution. But streaming LLM responses (by nature of being made up of lots of individual tokens) makes the polling design a bit more gross than it really needs to be. Which is where the idea of 'sessions' comes in.

sudb 19 hours ago [-]

Did you consider websockets? Curious to know if I'm missing something!

sharathr 15 hours ago [-]

[dead]

jimmypk 20 hours ago [-]

[dead]

potter098 24 hours ago [-]

[dead]

maxbeech 1 days ago [-]

[dead]

EthanFrostHI 19 hours ago [-]

[dead]

pando85 12 hours ago [-]

[dead]

Rendered at 10:18:09 GMT+0000 (Coordinated Universal Time) with Vercel.