Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents

89 points by atarus 1 days ago | 25 comments

shubhamintech 1 days ago [-]

The full-session evaluation framing is the right call - most teams don't realize the failure happened in turn 2 until they've spent 3 hours blaming the model. One thing worth thinking about as you grow: connecting caught regressions to production conversation data. When your simulation flags a new failure mode, being able to say "this pattern has already surfaced X times in prod this week" cuts the prioritization debate in half. Does Cekura currently let you correlate simulation failures back to real user sessions, or is that still a manual step?

atarus 1 days ago [-]

We track the failure modes in production directly instead of relying on simulation. So if suddenly we are seeing a failure mode pop up too often, we can alert timely. In the approach of going from simulation to monitoring, I am worried the feedback might be delayed.

Doing it in production also helps to go run simulations by replaying those production conversations ensuring you are handling regression.

bhekanik 9 hours ago [-]

This is a solid framing. In my experience the nasty regressions are rarely a bad single response; they are state drift over 6-12 turns (verification skipped, tool called in the wrong order, recovery path never triggered).

One thing that's helped us is tagging each test with an explicit risk class (safety/compliance, business logic, UX) and tracking those buckets over time instead of relying on one pass/fail number. Release decisions get much less hand-wavy when one category starts creeping.

Session-level eval plus risk-bucket trends feels like the right combo for teams shipping agents weekly.

CloakHQ 8 hours ago [-]

the risk-bucket framing is a useful upgrade from binary pass/fail. we've run into something similar where the failure category changes what you actually do about it - a compliance drift is a rollback, a business logic drift might just need a prompt tweak, a UX drift you might ship anyway and monitor. collapsing those into one score loses the signal that matters for the decision.

the 6-12 turn state drift pattern is real. in browser sessions the equivalent is the fingerprint signal accumulating - each action is fine, but by turn 8 you've built a behavioral profile that reads as bot. same compounding problem, different domain. makes me think the right eval unit for agents in general isn't the turn or even the session, it might be the trajectory shape.

MickeyShmueli 7 hours ago [-]

the mock tool platform thing is smart. testing agents against real APIs is a nightmare, you get flakiness, you burn through rate limits, and you can't reproduce failures

one thing i'm curious about: how do you handle testing the tool selection logic itself? like the agent choosing WHICH tool to call is often where things break, not the tool execution

we had a support agent that would sometimes call the "refund order" tool when the user just wanted to check order status. the tool worked perfectly, the LLM just kept picking the wrong one. your mock platform lets you verify the tool returns the right data, but does it catch when the agent calls the wrong tool entirely?

also the full-session evaluation vs turn-by-turn is spot on. had a similar issue with a verification flow where each individual turn looked fine in langsmith but the overall flow was completely broken. you'd see "assistant asked for name" (good), "assistant asked for phone" (good), "assistant processed request" (good), but it never actually verified the phone number matched the account

tbh this feels like one of those problems that's obvious in hindsight but nobody builds the tooling for until they get burned in production

michaellee8 6 hours ago [-]

In that case I think you can have a refund subagent that is responsible for checking if the user really asked for refund before doing these dangerous things. But it only minimize errors, LLMs are non-determinitic by nature.

CloakHQ 9 hours ago [-]

the full-session evaluation framing resonates a lot. we've been running browser automation agents and the exact same problem shows up: individual actions pass, session fails. a click works, a form fill works, but the session gets flagged or blocked 8 turns in because something earlier created a signal that compounded.

one failure mode that's specific to browser agents and doesn't get much attention: the test environment is too clean. when you run simulations against a controlled setup, the agent never encounters the friction that real sessions do - bot detection challenges, CAPTCHAs, dynamic content that loads differently, fingerprinting checks mid-session. so you end up with agents that pass your test suite but fail in the wild, and the gap is in the environmental assumptions not the agent logic.

the mock tool platform approach is interesting precisely because it sidesteps this - you're testing the agent's decision-making in isolation from the messy runtime. that's valid for catching logic regressions. but i'd be curious how you handle cases where the tool call itself triggers secondary effects in the environment (e.g. the API call changes session state in ways that affect what the agent sees next).

also, does your session-level judge handle cases where the correct behavior is adaptive - where the agent should change strategy mid-session based on what it encountered? that feels like a harder eval problem than a fixed expected outcome.

FailMore 1 days ago [-]

Any ideas how to solve the agent's don't have total common sense problem?

I have found when using agents to verify agents, that the agent might observe something that a human would immediately find off-putting and obviously wrong but does not raise any flags for the smart-but-dumb agent.

atarus 1 days ago [-]

To clarify you are using the "fast brain, slow brain" pattern? Maybe an example would help.

Broadly speaking, we see people experiment with this architecture a lot often with a great deal of success. A few other approaches would be an agent orchestrator architecture with an intent recognition agent which routes to different sub-agents.

Obviously there are endless cases possible in production and best approach is to build your evals using that data.

rush86999 1 days ago [-]

Only solution is to train the issue for the next time.

Architecturally focusing on Episodic memory with feedback system.

This training is retrieved next time when something similar happens

atarus 1 days ago [-]

Training is an overkill at this point imo. I have seen agents work quite well with a feedback loop, some tools and prompt optimisation. Are you doing fine-tuning on the models when you say training?

rush86999 1 days ago [-]

Nope - just use memory layer with model routing system.

https://github.com/rush86999/atom/blob/main/docs/EPISODIC_ME...

atarus 1 days ago [-]

Memory is usually slow and haven't seen many voice agents atleast leverage it. Are you building in text modality or audio as well?

niko-thomas 1 days ago [-]

We've tried a few platforms for voice agent testing and Cekura has been the best by a long shot. Keep up the great work!

sidhantkabra 1 days ago [-]

Was really fun building this - would love feedback from the HN community and get insights on your current process.

chrismychen 1 days ago [-]

How do you handle sessions where the correct outcome is an incomplete flow — e.g. the agent correctly refuses to move forwards because the caller failed verification, or correctly escalates to a human?

atarus 1 days ago [-]

This comes from our architecture. Since we are aware of the agent's context our test agents know the incomplete flows and the assertions are per session.

If we miss some cases, there's always a feedback loop to help improve your test suite

guerython 1 days ago [-]

we treat each scenario as an explicit state machine. every conversation has checkpoints (ask for name, verify dob, gather phone) and the case only passes if each checkpoint flips true before the flow moves on. that means if the agent hallucinates, skips the verification step, or escalates to a human too early you get a session-level failure, not just a happily-green last turn. logging which checkpoint stayed false makes regressions obvious when you swap prompts/models.

moinism 1 days ago [-]

congrats on the launch! do you guys have anything planned to test chat agents directly in the ui? I have an agent, but no exposed api so can't really use your product even though I have a genuine need.

atarus 1 days ago [-]

Yes, we do support integrations with different chat agent providers and also SMS/Whastap agents where you can just drop a number of the agent.

Let us know how your agent can be connected to and we can advise best on how to test it.

jamram82 1 days ago [-]

Testing voice agents would require some kind of knowledge integration. Do you have any plans to support custom knowledge bases for test voice agents ?

atarus 1 days ago [-]

Yes, we already support knowledge base integrations for BigQuery and plan to expand the set of connectors. You can always drop knowledge files currently.

Moreover, we even generate scenarios from the knowledge base

michaellee8 1 days ago [-]

Interesting, I have built https://github.com/michaellee8/voice-agent-devkit-mcp exactly for this, launch a chromium instance with virtual devices powered by Pulsewire and then hook it up with tts and stt so that playwright can finally have mouth and ears. Any chance we can talk?

atarus 1 days ago [-]

That's actually interesting. Is it a dependancy on user to create the HTTP endpoints for the /speak and /transcript?

One of our learnings has been to allow plugging into existing frameworks easily. Example - livekit, pipecat etc.

Happy to talk if you can reach out to me on linkedin - https://www.linkedin.com/in/tarush-agarwal/

michaellee8 17 hours ago [-]

Just sent an connection invitation on Linkedin. This is actually designed for allow e2e automation using playwright-mcp for a previous startup i worked in that does voice-based job interview agents. The http endpoints is provided by a daemom sitting on the background, listening all input to the virtual mic and transcribing and storing it. The agent can hit /speak and /transcript through an mcp. We have built Livekit Agents specific solutions by injecting text responses but felt that is not enough since we want to be able to test the whole thing end to end so I hacked a way to do virtual mic/speaker. It was designed for closing the dev-test-debug loop so that Claude Code can develop on its own rather than relying on human to test it.

octoclaw 1 days ago [-]

[dead]

berz01 1 days ago [-]

[dead]

agenthustler 20 hours ago [-]

[flagged]

Rendered at 22:44:20 GMT+0000 (Coordinated Universal Time) with Vercel.