I really want automated QA to work better! It's a great thing to work on.
Some feedback:
- I definitely don't want three long new messages on every PR. Max 1, ideally none? Codex does a great job just using emoji.
- The replay is cool. I don't make a website, so maybe I'm not the target market, but I'd like QA for our backend.
- Honestly, I'd rather just run a massive QA run every day, and then have any failures bisected, rather than per-PR.
- I am worried that there's not a lot of value beyond the intelligence of the foundation models here.
thienannguyencv 5 hours ago [-]
This benchmark measures whether tests are relevant, coherent, and have good coverage. But there's a more subtle type of error: AI creates tests that look specific to PR but are actually generic patterns mapped from the training data—correct test structure, reasonable assertions, but not actually interacting with what this specific piece of code does.
How do you differentiate between ""understood the code and generated a targeted test" and "recognized this looks like an auth flow and produced a standard auth test template"? The latter might still pass your coherence/relevance metrics while missing the actual exception.
Bnjoroge 1 days ago [-]
Agree on your last point and it's going to be a very bitter lesson. In any case, you probably wanna shift alot of the code verification as left as possible so doing review at PR time isnt the right strat imo. And claude/codex are well positioned to do the local review.
ashgam 1 hours ago [-]
Agree on the shift left concept, but curious on your thoughts about a checker-maker loop. Running a PR review bot is different from running /review on local dev right? And also there has been instance of Claude already patching the test scripts instead of fixing the bugs to make the tests pass.
52 minutes ago [-]
arkheosrp26 19 hours ago [-]
[flagged]
monkpit 15 hours ago [-]
Isn’t the last point the case with every AI startup? Nobody has a moat and it’s tough to build one because the playing field is so level.
_heimdall 8 hours ago [-]
I've been confused by this with many LLM products in general. Sometimes infrastructure is part of it so there's that, but often it seems like the product is a magic incantation of markdown files.
ashgam 1 hours ago [-]
Solving for infrastructure is a huge part of the problem too. Curious to know what you think about it?
_heimdall 47 minutes ago [-]
Here I'm mostly considering the seemingly countless services that are little more than some markdown files and their own API passing data to/from the LLM procider's API.
By no means is that every AI product today, and I wasn't saying the OP QA service falls into that bucket though.
More of a general comment related to the GP, maybe too off topic here though?
Visweshyc 1 days ago [-]
Thanks for the feedback!
- Agreed that the form factor can be condensed with a link to detailed information
- With the codebase understanding, backend is where we are looking to expand and provide value
- The intelligence of the models does lay out the foundation but combining the strength of these models unlocks a system of specialized agents that each reason about the codebase differently to catch the unknown unknowns
pastescreenshot 15 hours ago [-]
The interesting question to me is not whether the system can generate a plausible PR-time test, but whether the useful ones survive after the PR is gone. If Canary catches a real regression, how often can that check be promoted into a stable long-lived regression test without turning into a flaky, environment-coupled browser script? That conversion rate feels closer to the real moat than the generation demo.
Visweshyc 13 hours ago [-]
Good point. To keep the regression tests reliable as the app evolves, we run a reliability cascade. First, we generate and execute deterministic Playwright from the codebase. If execution fails then we fall back to DOM and aria tree. If that still fails, we fall back to vision agents that verify what the user actually sees before flagging a drift in the application behavior
recsv-heredoc 1 days ago [-]
The market timing on this is perfect - it fills a major current gap I've seen emerging.
I've heard a few stories of QA departments being near-burnout due to the increased rate developers are shipping at these days. Even we're looking for any available QA resources we can pull in here.
No harm meant with the question - but what's the advantage over Claude Code + the GitHub integrations?
Visweshyc 23 hours ago [-]
We evaluated test generation using Claude code and our purpose built harness and measured the quality of tests in catching the unknown unknowns. We noticed Claude Code misses the second order effects that actually break applications. You also need infrastructure to execute the tests - browser fleets, ephemeral environments, data seeding need to be handled
warmcat 1 days ago [-]
Good work. But what makes this different than just another feature in Gemini Code assist or Github copilot?
Visweshyc 1 days ago [-]
Thanks! To execute these tests reliably you would need custom browser fleets, ephemeral environments, data seeding and device farms
mikestorrent 15 hours ago [-]
If that's what you guys are bringing, you should put that more up front; focus on making it clear you're providing ingredients that Claude et al will not be providing on their own without Real Actual Software to do it.
Visweshyc 13 hours ago [-]
Fair feedback. Will make that clearer. Appreciate it
solfox 1 days ago [-]
Not a direct competitor but another YC company I use and enjoy for PR reviews is cubic.dev. I like your focus on automated tests.
Visweshyc 1 days ago [-]
Thanks! We believe executing the scenarios and showing what actually broke closes the loop
Bnjoroge 1 days ago [-]
what kinds of tests does it generate and how's this different from the tens of code review startups out there?
Visweshyc 1 days ago [-]
The system focuses on going beyond the happy path and generating edge case tests that try to break the application. For example, a Grafana PR added visual drag feedback to query cards. The system came up with an edge case like - does drag feedback still work when there's only one card in the list, with nothing to reorder against?
solfox 1 days ago [-]
Looks interesting! Looks like perhaps no support for Flutter apps yet?
Visweshyc 1 days ago [-]
Yes we currently support web apps but plan to extend the foundation to test mobile applications on device emulators
opensre 24 hours ago [-]
[flagged]
tgtracing 14 hours ago [-]
[dead]
vivzkestrel 16 hours ago [-]
- there are atleast 10 dozen code review startups at this point and i see a new one on YC every week
- what is your differentiator?
Visweshyc 13 hours ago [-]
We see this as different from review. The system generates tests to catch second-order effects and executes them against the live application to expose bugs
Rendered at 20:02:24 GMT+0000 (Coordinated Universal Time) with Vercel.
Some feedback:
- I definitely don't want three long new messages on every PR. Max 1, ideally none? Codex does a great job just using emoji.
- The replay is cool. I don't make a website, so maybe I'm not the target market, but I'd like QA for our backend.
- Honestly, I'd rather just run a massive QA run every day, and then have any failures bisected, rather than per-PR.
- I am worried that there's not a lot of value beyond the intelligence of the foundation models here.
How do you differentiate between ""understood the code and generated a targeted test" and "recognized this looks like an auth flow and produced a standard auth test template"? The latter might still pass your coherence/relevance metrics while missing the actual exception.
By no means is that every AI product today, and I wasn't saying the OP QA service falls into that bucket though.
More of a general comment related to the GP, maybe too off topic here though?
I've heard a few stories of QA departments being near-burnout due to the increased rate developers are shipping at these days. Even we're looking for any available QA resources we can pull in here.
No harm meant with the question - but what's the advantage over Claude Code + the GitHub integrations?
- what is your differentiator?