The comment about having "3 to 6 hours per day" to work directly with code is the key insight here. I run a small AI consultancy and use Claude Code daily to deliver client projects — chatbots, automation pipelines, API integrations — and the spec-driven approach described in this post is what makes it actually work at scale.
The pattern I've converged on: spend the first 30 minutes writing detailed markdown specs (inputs, outputs, edge cases, integration points), then let Claude Code chew through the implementation while I review, test, and iterate. For a typical automation project — say a WhatsApp bot that handles booking flows and integrates with a client's CRM — this cuts delivery time roughly in half compared to writing everything manually.
The biggest practical lesson: the spec quality is everything. A vague spec produces code you'll spend more time debugging than you saved. A good spec with explicit error handling expectations, API response formats, and state transitions produces code that's 80-90% production-ready on the first pass.
Where I disagree slightly with the parallel agent approach: for client-facing work where correctness matters more than speed, I've found 2-3 focused agents (one on backend, one on frontend, one on tests) more reliable than 6-8 competing agents that create merge conflicts. The overhead of resolving conflicts and ensuring consistency across parallel outputs eats into the productivity gains fast.
theshrike79 12 hours ago [-]
I've recently started adding a PROJECT.md to all my own projects to keep the direction consistent.
Just something that tells the LLM (and me, as I tend to forget) what is the actual purpose of the project and what are the next features to be added.
In many cases the direction tends to get lost and the AI starts adding features like it's doing a multi-user SaaS or helfully adding things that aren't in the scope for the project because I have another project doing that already.
agenthustler 2 hours ago [-]
The STATE.md approach described here is exactly how I've been running an autonomous agent for the past 23 days.
The agent wakes up every 2 hours with zero memory (no persistent context), reads its own STATE.md to understand what past iterations did, evaluates what's working, and acts. The STATE.md is simultaneously the spec AND the execution log.
What I've found: the biggest failure mode isn't the agent making bad decisions — it's the agent rediscovering its own architecture every run. Each time it reads STATE.md, it's essentially a new agent with inherited knowledge. This creates an interesting pressure to write STATE.md entries that are actually useful to a future self, not just status updates.
The spec-driven approach here solves the same problem from the other direction: the spec gives the agent enough context to act without needing to reconstruct context from scratch.
The tmux session management also resonates — we use a LaunchAgent plist for the same reason: consistent environment with keychain access, no cron auth issues.
I did a sort of bell curve with this type of workflow over summer.
- Base Claude Code (released)
- Extensive, self-orchestrated, local specs & documentation; ie waterfall for many features/longer term project goals (summer)
- Base Claude Code (today)
Claude Code is getting better at orchestrating it's own subagents for divide/conquer type work.
My problem with these extensive self-orchestrated multi-agent / spec modes is the type of drift and rot of all the changes and then integrated parts of an application that a lot of the time end up in merge conflicts. Aside from my own decision cognitive space, it's also a lot to just generally orchestrate and review. I spent a ton of type enforcing Claude to use the system I put in place including documentation updates and continuous logging of work.
I feel extremely productive with a single Claude Code for a project. Maybe for minor features, I'll launch Claude Code in the web so that it can operate in an isolated space to knock them out and create a PR.
I will plan and annotate extensively for large features, but not many features or broad project specs all at the same time. Annotation and better planning UX, I think, are going to be increasingly important for now. The only augment of Claude Code I have is a hook for plan mode review: https://github.com/backnotprop/plannotator
schipperai 1 days ago [-]
The merge conflicts and cognitive load are indeed two big struggles with my setup. Going back to a single Claude instances however would mean I’m waiting for things to happen most of the time. What do
you do while Claude is busy?
medi8r 1 days ago [-]
It is one of those things I look and thing, yeah you are hyper productive... but it looks cognitively like being a pilot landing a plane all day long, and not what I signed up for. Where is my walk in the local park where I think through stuff and come up with a great idea :(
esperent 1 days ago [-]
I think that's slightly the wrong way to look at this multi agent stuff.
I have between 3 and 6 hours per day where I can sit in front of a laptop and work directly with the code. The more of the actual technical planning/coding/testing/bug fixing loop I can get done in that time the better. If I can send out multiple agents to implement a plan I wrote yesterday, while another agent fixes lint and type errors, a third agent or two or three are working with me on either brainstorming or new plans, that's great! I'll go out for a walk in the park and think deeply during the rest of the day.
When people hear about all of these agent - working on three plans at once, really? - it sounds overwhelming. But realistically there's a lot of downtime from both sides. I ask the agent a question, it spends 5-10 minutes exploring. During that time I check on another agent or read some code that has been generated, or do some research of my own. Then I'll switch back to that terminal when I'm ready and ask a follow up question, mark the plan as ready, or whatever.
The worst thing I did when I was first getting excited about how agents were good now, a whole two months ago, was set things up so I could run a terminal on my phone and start sessions there. That really did destroy my deep thinking time, and lasted for about 3 days before I deleted termux.
schipperai 1 days ago [-]
it can be cognitively demanding but you adapt and often get in a flow state… but it’s nothing like programming used to be though and I get that
ramoz 1 days ago [-]
Quite a bit.
- Research
- Scan the web
- Text friends
- Side projects
- Take walks outside
etc
synergy20 19 hours ago [-]
I use claude-code. claude-code now spins up many agents on its own, sometimes switch models to save costs, and can easily use 200+ tools concurrently, and use multiple skills at the same time when needed, its automation gets smarter and more parallel by the day, do we still need to outwit what's probably already done by claude-code? I still use tmux but no longer for multiple agents, but for me to poke around at will, I let the plan/code/review/whatever fully managed and parallelized by claude-code itself, it's massively impressive.
nkko 18 hours ago [-]
This rings true, as I’ve noticed that with every new model update, I’m leaving behind full workflows I’ve built. The article is really great, and I do admire the system, even if it is overengineered in places, but it already reads like last quarter’s workflow. Now letting Codex 5.3 xhigh chug for 30 minutes on my super long dictated prompt seems to do the trick. And I’m hearing 5.4 is meaningfully better model. Also for fully autonomous scaffolding of new projects towards the first prototype I have my own version of a very simple Ralph loop that gets feed gpt-pro super spec file.
suls 17 hours ago [-]
The bigger question for me is how to use this efficiently as a team of engineers. Most workflow tools i've seen so far focus on making a single engineer get more out of a claude/codex subscription but not much how teams as a whole can become more productive.
Any ideas?
jillesvangurp 16 hours ago [-]
My hunch is to experiment not as a team but individually. With teams you want a bit more stability in terms of workflows. A lot of this stuff involves people handcrafting workflows on top of tools and models that are dramatically changing nearly constantly. That kind of chaos is not something you want at the team level.
I'm mostly sticking to a codex workflow. I transitioned from the cli to their app when they released it a few weeks ago and I'm pretty happy with that. I've had to order extra tokens a few times but most weeks I get by on the 20$ Chat GPT Plus subscription. That's not really compatible with burning hundreds/thousands on using lots of parallel agents in any case.
I also have a hunch that there are some fast diminishing returns on that kind of spending. At least, I seem to get a lot of value out of just spending 20/month. A lot of that more extreme burn might just be tool churn / inefficiency.
With teams, basically you should organize around CI/CD, pull requests and having code reviews (with or without AI assists). Standard stuff; you should be doing that anyway. But doubling down on making this process fast and efficient pays off. With LLMs the addition to this would be codifying/documenting key skills in your repositories for doing stuff with your code base and ways of working. A key thing in teams is to own and iterate on that stuff and not let it just rot. PRs against that should be well reviewed and coordinated and not just sneaked in.
Otherwise, AI usage just increases the volume of PRs and changes. Most of these tools in any case work a lot better if you have a good harness around your workflow that allows it to run linting/tests, etc. If you have good CI, this shouldn't be hard to express in skill form. The issue then becomes making sure the team gets good at producing high quality PRs and processing them efficiently. If you are dealing with a lot of conflicts, PR scope creep, etc. that's probably not optimal.
A lot of stuff related to coordinating via issue trackers can also be done with agents. If you have gh cli set up, it can actually create, label, etc. or act on github issues. That opens the door to also using LLMs for broader product management. It's something I've been meaning to experiment with more. But for bigger teams that could be something to lean on more. LLMs filing lots of issues is only helpful if you have the means to stay on top of that. That requires workflows where a lot of issues are short lived (time to some kind of resolution). This is not something many teams are good at currently.
wiseowise 14 hours ago [-]
The bigger question for me is how much compensation is increased as we enter this insanity? (It’s a rhetoric question, answer is 0)
samusiam 13 hours ago [-]
IMO we should all be asking for a raise if our company is making more money. Proportionally, even.
v_CodeSentinal 21 hours ago [-]
The deny list section hit home. I keep seeing agents use unlink instead of rm, or spawn a python subprocess to delete files. Every new rule just taught the agent a new workaround.
Ended up flipping the model — instead of blocking bad actions, require proof of safety before any action runs. No proof, no action. Much harder to route around.
Curious if you've tried anything similar.
hrimfaxi 21 hours ago [-]
What does proof of safety look like in practice? Could you give some examples?
v_CodeSentinal 20 hours ago [-]
Nothing super fancy.
For me “proof” just means the agent has to make its intent explicit in a way I can check before running it.
For example:
1) If it wants to delete a file, it has to output the exact path it thinks it’s deleting. I normalize it and make sure it’s inside the project root. If not, I block it.
2) If it proposes a big change, I require a diff first instead of letting it execute directly.
3) After code changes, I run tests or at least a lint/type check before accepting it.
So it’s less about formal proofs and more about forcing the agent to surface assumptions in a structured way, then verifying those assumptions mechanically.
Still hacky, but it reduced the “creative workaround” behavior a lot.
schipperai 14 hours ago [-]
Is this a policy snippet you add to your CLAUDE.md? Do you still maintain a deny list?
I recently added a snippet asking Claude to not try to bypass the deny list. I didn't have an incidence since but Im still nervous... Claude once bypassed the deny list and nuked an important untracked directory which caused me lots of trouble.
For major, in depth refactors and large scale architectural work, it's really important to keep the agents on-track, to prevent them from assuming or misunderstanding important things, or whatever — I can't imagine what it'd be like doing parallel agents. I don't see how that's useful. And I'm a massive fan of agentic coding!
It's like OpenClaw for me — I love the idea of agentic computer use; but I just don't see how something so unsupervised and unsupervisable is remotely a useful or good idea.
tinodb 2 hours ago [-]
I have found that with a good plan we are able to make big refactors quite a bit faster. The approach is that our /create-plan command starts high level, and only when we agree on that, fills in the details. It will also determine in what pull requests it plans to deliver it. The size estimation of the prs is never correct, but it gives a good enough phase split for the next step. Which is letting it rip with a “Ralph loop” (just a bash script while with claude -p —yolo). This with instructions to use jj (or git) and some other must read skills.
This lets us review the end result, and correct with a review. That then gets incorporated whilst having claude rework the actual small prs that we can easily review and touch up.
I must say jj helps massively in staying sane and rebasing a lot. Claude fixes the conflicts fine.
We have been able to push ~5K of changes in a couple days, whilst reviewing all code, and making sure it’s on par with our quality requirements. And not writing a line of code ourselves.
I would have never attempted these large scale refactors, and we would have been stuck with the tech debt forever in the past.
CloakHQ 1 days ago [-]
We ran something similar for a browser automation project - multiple agents working on different modules in parallel with shared markdown specs. The bottleneck wasn't the agents, it was keeping their context from drifting. Each tmux pane has its own session state, so you end up with agents that "know" different versions of reality by the second hour.
The spec file helps, but we found we also needed a short shared "ground truth" file the agents could read before taking any action - basically a live snapshot of what's actually done vs what the spec says. Without it, two agents would sometimes solve the same problem in incompatible ways.
Has anyone found a clean way to sync context across parallel sessions without just dumping everything into one massive file?
sarkarsh 4 hours ago [-]
The ground truth file is the right idea but flat markdown breaks down fast once agents need to actually coordinate. They cant query it -- they just re-read everything.
We hit the same wall running parallel Claude Code sessions. Ended up switching to ctlsurf, which gives agents structured blocks via MCP -- task tables, key-value state, append-only logs. An agent can check "is anyone already on the auth module" by querying a table, not parsing prose. State survives across sessions too, so a new agent doesnt start from scratch.
The drift problem across parallel agents is exactly what convinced me flat files werent enough.
tdaltonc 1 days ago [-]
This maps closely to something we've been exploring in our recent paper. The core issue is that flat context windows don't organize information scalably, so as agents work in parallel they lose track of which version of 'reality' applies to which component. We proposed NERDs (Networked Entity Representation Documents), Wikipedia-style docs that consolidate all info about a code entity (its state, relationships, recent changes) into a single navigable document, corss-linked with other other documents, that any agent can read. The idea is that the shared memory is entity-centered rather than chronological. Might be relevant: https://www.techrxiv.org/users/1021468/articles/1381483-thin...
CloakHQ 13 hours ago [-]
The entity-centered framing is a really useful reframe. The chronological context window is kind of the wrong shape for this problem - agents care about "what is the current state of X" not "what happened in what order".
The NERDs approach reminds me of how we ended up thinking about it from the browser automation side too: the thing you actually need to share isn't history, it's a live description of each component's current state and its dependencies. We got there through trial and error but it's interesting to see the same insight in a formal model.
Reading the paper now - curious how you handle NERDs for components that change state frequently at runtime (vs design-time decisions). Browser sessions are a good example - the "entity" changes with every request.
oceanic 1 days ago [-]
I’ve been using Steve Yegge’s Beads[1] lightweight issue tracker for this type of multi-agent context tracking.
I only run a couple of agents at a time, but with Beads you can create issues, then agents can assign them to themselves, etc. Agents or the human driver can also add context in epics, and I think you can have perpetual issues which contain context too. Or could make them as a type of issue yourself, it’s a very flexible system.
Beads has been on my list to try. I can see it being a natural evolution of my setup
schipperai 1 days ago [-]
I avoid this with one spec = one agent, with worktrees if there is a chance of code clashing. Not ideal for parallelism though.
CloakHQ 1 days ago [-]
The worktree approach is interesting - keeps the filesystem separation clean. The parallelism tradeoff makes sense if the tasks are truly independent, which in practice is most of the time anyway.
What does your spec file look like when you kick off a new agent? Curious if you start from scratch each time or carry over context from previous sessions on the same project.
schipperai 1 days ago [-]
I describe this in the article - I mostly kick off a new agent per spec both for Planners and Workers. I do tend to run /fd-explore before I start work on a given spec to give the agent context of the codebase and recent previous work
CloakHQ 13 hours ago [-]
The /fd-explore step makes sense - basically giving the agent a map before it starts navigating. We've ended up doing something similar, though less formal: just pointing the agent at a README and a directory listing before each session so it has some spatial orientation.
The Planners/Workers split is interesting too. Do you find the Worker agents stay within scope reliably, or do you still end up with surprises when they go off-plan?
briantakita 1 days ago [-]
I've been building agent-doc [1] to solve exactly this. Each parallel Claude Code session gets its own markdown document as the interface (e.g., tasks/plan.md, tasks/auth.md). The agent reads/writes to the document, and a snapshot-based diff system means each submit only processes what changed — comments are stripped, so you can annotate without triggering responses.
The routing layer uses tmux: `agent-doc claim`, `route`, `focus`, `layout` commands manage which pane owns which document, scoped to tmux windows. A JetBrains plugin lets you submit from the IDE with a hotkey — it finds the right pane and sends the skill command.
For context sync across agents, the key insight was: don't sync. Each agent owns one document with its own conversation history. The orchestration doc (plan.md) references feature docs but doesn't duplicate their content. When an agent finishes a feature, its key decisions get extracted into SPEC.md. The documents ARE the shared context — any agent can read any document.
It's been working well for running 4-6 parallel sessions across corky (email client), agent-doc itself, and a JetBrains plugin — all from one tmux window with window-scoped routing.
The "don't sync, own" model makes a lot of sense. We were thinking about it wrong - trying to push state out to a shared file, when the cleaner move is to pull it in on demand.
The SPEC.md as the extraction target after a feature is done is a nice touch. In our case the tricky part is that browser automation state is partly external - you have sessions, cookies, proxy assignments that live outside the codebase. So the "ground truth" we needed wasn't just about code decisions but about runtime state too. Ended up logging that separately.
Checking out agent-doc, the snapshot-based diff to avoid re-triggering on comments is clever. Does it handle cases where two agents edit the same doc around the same time, or is the ownership model strict enough that this doesn't come up?
briantakita 37 minutes ago [-]
The runtime state logging approach makes sense for browser automation — that's a domain where ground truth literally lives outside your repo. We have a similar dynamic with email state in corky (IMAP sync, draft queues). Same pattern: log the external state separately and let the document reference it.
On concurrent editing — it's handled at two levels:
*Ownership:* Each document is claimed by one tmux pane (one agent session). The routing layer prevents two agents from working the same doc simultaneously.
*3-way merge:* If I edit the document while the agent is mid-response, agent-doc detects the change on write-back and runs `git merge-file --diff3` — baseline (pre-commit), agent response, and my concurrent edits all merge. Non-overlapping changes merge cleanly; overlapping changes get conflict markers. Nothing is silently dropped.
The pre-submit git commit is the key — it creates an immutable baseline before the agent touches anything, so there's always a clean reference point for the merge.
jasonjmcghee 1 days ago [-]
I certainly don't run 6 at a time, but even with just 1 - if it's doing anything visual - how are folks hooking up screenshots to self verify? And how do you keep an eye on it?
The only solution I've seen on a Mac is doing it on a separate monitor.
I couldn't find a solution here and have built similar things in the past so I took a crack at it using CGVirtualDisplay.
Ended up adding a lot of productivity features and polished until it felt good.
Curious if there are similar solutions out there I just haven't seen.
For macOS, generically, you can run `screencapture -o -l $WINDOW_ID output.png` to screenshot any window. You can list window IDs belonging to a PID with a few lines of Swift (that any agent will generate). Hook this up together and give it as a tool to your agents.
jasonjmcghee 9 hours ago [-]
And the compositor renders it unoccluded for the screenshot?
danbala 24 hours ago [-]
for anything web related, simply with the chrome claude plugin. then claude code can control the browser (and 'see' what's showing),
jlongo78 20 hours ago [-]
The key insight with tmux agent parallelism is giving each pane a dedicated markdown spec file rather than sharing context. Agents stay focused and you avoid prompt contamination across sessions. Name your windows after the spec, not the task, so you can resume cold sessions without re-reading logs. Also worth adding a status pane that tails a shared cost log, otherwise parallel runs burn budget fast without you noticing.
gas9S9zw3P9c 1 days ago [-]
I'd love to see what is being achieved by these massive parallel agent approaches. If it's so much more productive, where is all the great software that's being built with it? What is the OP building?
Most of what I'm seeing is AI influencers promoting their shovels.
TheCowboy 22 hours ago [-]
> If it's so much more productive, where is all the great software that's being built with it?
This is such a new and emerging area that I don't understand how this is a constructive comment on any level.
You can be skeptical of the technology in good faith, but I think one shouldn't be against people being curious and engaging in experimentation. A lot of us are actively trying to see what exactly we can build with this, and I'm not an AI influencer by any means. How do we find out without trying?
I still feel like we're still at a "building tools to build tools" stage in multi-agent coding. A lot of interesting projects springing up to see if they can get many agents to effectively coordinate on a project. If anything, it would be useful to understand what failed and why so one can have an informed opinion.
tedeh 22 hours ago [-]
I don't think it is unreasonable to ask where all the great AI built software is. There has been comments here on HN about people becoming 30 to 50 times more productive than before.
To put a statement like that into perspective (50 times more productive): The first week of the year about as much was accomplished as the whole previous year put together.
theshrike79 16 hours ago [-]
I haven't made any "great" software ever in my life. With AI or without.
But with AI assistance I've made SO MANY "useful", "handy" and "nifty" tools that I would've never bothered to spend the time on.
Like just last night I had Claude make a shell script on a whim that lets me use fzf to choose a running tmux session - with a preview of what the session's screen looks like.
Could I make it by hand? Yep. Would I have bothered? Most likely no.
Now it got done and iterated on my second monitor while I was watching 21 Bridges on my main monitor and eating snacks. (Chadwick Boseman was great in it)
sofal 20 hours ago [-]
I'd question your assumption that the software would be "great". I think we're seeing the volume of software increase faster than before. The average quality of the total volume of software will almost certainly decrease. It's not a contradiction for productivity in that respect to increase while quality decreases.
wiseowise 14 hours ago [-]
Well, if your produced value was 0 in the first place, multiplying that by a hundred will still be zero. Best example of that are claws: a lot of hype but just vapor, twitter fart at best.
TheCowboy 21 hours ago [-]
I'm honestly not a big fan of when people throw out numbers implying a high degree of rigor without actually showing me evidence so I can judge for myself. If you're this much more productive, then use some % of that newly discovered productivity to show us.
But building software does tend to come with a lag even with AI. And we're also just more likely to see its influence in existing software first.
I'd rather be asking where it is AND actively trying to explore this space so I have a better grasp of the engineering challenges. I think there's just too many interesting things happening to be able to just wave it off.
mycall 21 hours ago [-]
The hard part about extracting patterns right now is that they shift every 2-4 months now (was every 6-12 month in 2024-2025). What works for you today might be obsolete in May.
jjmarr 1 days ago [-]
I just avoided $1.8 million/year in review time w/ parallel agents for a code review workflow.
We have 500+ custom rules that are context sensitive because I work on a large and performance sensitive C++ codebase with cooperative multitasking. Many things that are good are non-intuitive and commercial code review tools don't get 100% coverage of the rules. This took a lot of senior engineering time to review.
Anyways, I set up a massive parallel agent infrastructure in CI that chunks the review guidelines into tickets, adds to a queue, and has agents spit up GitHub code review comments. Then a manager agent validates the comments/suggestions using scripts and posts the review. Since these are coding agents they can autonomously gather context or run code to validate their suggestions.
Instantly reduced mean time to merge by 20% in an A/B test. Assuming 50% of time on review, my org would've needed 285 more review hours a week for the same effect. Super high signal as well, it catches far more than any human can and never gets tired.
Likewise, we can scale this to any arbitrary review task, so I'm looking at adding benchmarking and performance tuning suggestions for menial profiling tasks like "what data structure should I use".
theshrike79 14 hours ago [-]
This is what Google uses in their internal review systems - at least their AI team does this.
Heard a presentation from one of their AI engineers where they had a few slides about using multi-agent systems with different focuses looking through the code before a single human is pinged to look at the pull request.
jjmarr 4 hours ago [-]
If only I could get into a Google AI team!
Unfortunately I didn't graduate from Waterloo nor did I have referrals last year, so Google autorejects me from even forward deployed engineer roles without even giving me an OA.
Instead I get to maintain this myself for several hundred developers as a junior and get all my guidance from HN.
sarchertech 22 hours ago [-]
>$1.8 million
That sounds like a completely made up bullshit number that a junior engineer would put on a resume. There’s absolutely no way you have enough data to state that with anything approaching the confidence you just did.
jjmarr 21 hours ago [-]
It's definitely a resume number I calculated as a junior engineer. Feel free to give feedback on my math.
It is based on $125/hr and it assumes review time is inversely proportional to number of review hours.
Then time to merge can be modelled as
T_total = T_fixed + T_review
where fixed time is stuff like CI. For the sake of this T_fixed = T_review i.e. 50% of time is spent in review. (If 100% of time is spent in review it's more like $800k so I'm being optimistic)
T_review is proportional to 1/(review hours).
We know the T_total has been reduced by 23.4% in an A/B test, roughly, due to this AI tool, so I calculate how much equivalent human reviewer time would've been needed to get the same result under the above assumptions. This creates the following system of equations:
T_total_new = T_fixed + T_review_new
T_total_new = T_total * (1 - r)
where r = 23.4%. This simplifies to:
T_review_new = T_review - r * T_total
since T_review / T_review_new = capacity_new / capacity_old (because inverse proportionality assumption). Call this capacity ratio `d`. Then d simplifies to:
d = 1/(1 - r/(T_review/T_total))
T_review/T_total is % of total review time spent on PR, so we call that `a` and get the expression:
d = 1 / (1 - r/a)
Then at 50% of total time spent on review a=0.5 and r = 0.234 as stated. Then capacity ratio is calculated at:
d ≈ 1.8797
Likewise, we have like 40 reviewers devoting 20% of a 40 hr workweek giving us 320 hours. Multiply by original d and get roughly 281.504 hours of additional time or $31588/week which over 52 weeks is little over $1.8 million/year.
Ofc I think we cost more than $125 once you consider health insurance and all that, likewise our reviewers are probably not doing 20% of their time consistently, but all of those would make my dollar value higher.
The most optimistic assumption I made is 50% of time spent on review.
sarchertech 21 hours ago [-]
The feedback is don’t put it on a resume because it looks ridiculous. I can almost guarantee you that an A/B test design wasn’t rigorous enough for you to be that confident in your numbers.
But even if that is correct you need a much longer time frame to tell if reviews using this new tool are equivalent as a quality control measure.
And you have so many assumptions built in to this that are your number is worthless. You aren’t controlling for all the variables you need to control for. How do you know that workers spend 8 hours a week on reviews vs spending 2 hours and slacking off the other 6 hours? How do you know that the change of process created by using this tool doesn’t just cause the reviewers to work harder, but they’ll stop doing that once the novelty wears off? What if reviewers start relying on this tool to catch a certain class of errors for which it has low sensitivity?
It’s also a moot point if they don’t actually end up saving the money you say they will. It could be that all the savings is eaten up because of the reviewers just use the extra time to dick around on hacker news. It could just be that people aren’t able to make productive use of their time saved. Maybe they were already maxing out their time doing other useful activities.
All of this screams junior engineer took very limited results and extrapolated to say “saved the company millions” without nearly enough supporting evidence. Run your tool for 6 months, take an actual business outcome like time to merge PRs, measure that, and put that on your resume.
It’s incredibly common for a junior engineer to create some new tooling, and come up with some numbers to justify how this new tooling saves the company millions in labor. I have never once seen these “savings” actually pan out.
jjmarr 20 hours ago [-]
I took it off LinkedIn and replaced with time to merge reduction of 20% over two weeks of PRs (rounding down). I expect to justify the expenditure to non-technical managers in my current role, which is why I picked $s.
> All of this screams junior engineer took very limited results and extrapolated to say “saved the company millions” without nearly enough supporting evidence.
That's what the only person in my major who got a job at FAANG in California did, which is why I borrowed the strategy since it seems to work.
> I can almost guarantee you that an A/B test design wasn’t rigorous enough for you to be that confident in your numbers.
Shoot me an email about methodology! It's my username at gmail. I'd be happy to get more mentorship about more rigorous strategies and I can respond to concerns in less of a PR voice.
ecliptik 1 days ago [-]
It's for personal use, and I wouldn't call it great software, but I used Claude Code Teams in parallel to create a Fluxbox-compatible window compositor for Wayland [1].
Overall effort was a few days of agentic vibe-coding over a period of about 3 weeks. Would have been faster, but the parallel agents burn though tokens extremely quickly and hit Max plan limits in under an hour.
This is really cool. Out of curiosity did you know how to do this sort of programming prior to LLMs?
ecliptik 23 hours ago [-]
Not really, most of my programming experience is for devops/sysadmin scripts with shell/perl. I can read python/ruby from supporting application teams, but never programmed a large project or application with it myself. Last I used C was 25 years ago in some college courses and was never very good with it.
indigodaddy 1 days ago [-]
Pretty cool!
fhd2 1 days ago [-]
Even if somebody shows you what they've built with it, you're none the wiser. All you'll know is that it seemingly works well enough for a greenfield project.
The jury is still very far out on how agentic development affects mid/long term speed and quality. Those feedback cycles are measured in years, not weeks. If we bother to measure at all.
People in our field generally don't do what they know works, because by and large, nobody really knows, beyond personal experiences, and I guess a critical mass doesn't even really care. We do what we believe works. Programming is a pop culture.
suzzer99 1 days ago [-]
Does good design up front matter as much if an AI can refactor in a few hours something that would take a good developer a month? Refactoring is one of those tasks that's tedious, and too non-trivial for automation, but seems perfect for an AI. Especially if you already have all the tests.
qudat 23 hours ago [-]
I’m constantly using code agents to work on feature development and they are constantly getting things wrong. They can refactor high level concepts but I have to nudge them to think about the proper abstractions. I don’t see how a multiagent flow could handle those interactions. The bus factor is 1, me.
cloverich 18 hours ago [-]
Try building review skills based on how you review. I built one recently based on how I review some of the concurrent backend stuff one of our tools does. I have it auto-run on every PR. It's great, it catches tons of stuff, and ranks the issues by severity. Over 10 reviews, only 1 false positive (hallucination) and several critical catches. I wish I'd set it up sooner.
Can also after those sessions where they get stuff wrong, ask for an analysis of what it got wrong that session, and produce a ranked list. I just started that and wow, it comes up with pretty solid lists. I'm not sure if its sustainable to simply consolidate and prune it, but maybe it is?
veilrap 1 days ago [-]
Upgrades, API compatibility, and cross version communication are really important in some domains. A bad design can cause huge pain downstream when you need to make a change.
Jensson 23 hours ago [-]
> Especially if you already have all the tests.
Most tests people write have to be changed if you refactor.
briantakita 1 days ago [-]
I am now releasing software for projects that have spent years on the back-burner. From my perspective, agent loops have been a success. It makes the impractical pipe-dream doable.
Nadya 1 days ago [-]
Yeah, I have a never ending need of things I could easily make myself I I could set aside 7-10 hours to plan it out, develop and troubleshoot but are also low priority enough that they sit on the back burner perpetually.
Now these things are being made. I can justify spending 5-10 minutes on something without being upset if AI can't solve the problem yet.
And if not, I'll try again in 6 months. These aren't time sensitive problems to begin with or they wouldn't be rotting on the back burner in the first place.
sarchertech 1 days ago [-]
That’s completely ignoring the point of the person you are responding to. They weren’t talking about small greenfield projects.
briantakita 24 hours ago [-]
Agent loops also enables the "hard discipline" of making sure all of the tests are written, documentation is up to date, specs are explicitly documented, etc. Stuff that often gets dropped/deprioritized due to time pressure & exhaustion. Gains from automation applies to greenfield & complex legacy projects.
sarchertech 22 hours ago [-]
Well that’s more on topic as a response to the original poster. Still not really in keeping with the original thread question though of show me the beef.
echelon 1 days ago [-]
I'm using Claude Code (loving it) and haven't dipped into the agentic parallel worker stuff yet.
Where does one get started?
How do you manage multiple agents working in parallel on a single project? Surely not the same working directory tree, right? Copies? Different branches / PRs?
You can't use your Claude Code login and have to pay API prices, right? How expensive does it get?
ecliptik 1 days ago [-]
Check out Claude Code Team Orchestration [1].
Set an env var and ask to create a team. If you're running in tmux it will take over the session and spawn multiple agents all coordinated through a "manager" agent. Recommend running it sandboxed with skip-dangerous-permissions otherwise it's endless approvals
Churns through tokens extremely quickly, so be mindful of your plan/budget.
git checkout four copies of your repo (repo, repo_2, repo_3, repo_4)
within each one open claude code
Works pretty well! With the $100 subscription I usually don't get limited in a day. A lot of thinking needs to go into giving it the right context (markdown specs in repo works for us)
Obv, work on things that don't affect each other, otherwise you'll be asking them to look across PRs and that's messy.
From personal experience, SW that was developed with agent does not hit the road because:
a) learning and adapting is at first more effort, not less,
b) learning with experiments is faster,
c) experiencing the acceleration first hand is demoralising,
d) distribution/marketing is on an accelerated declining efficiency trajectory (if you want to keep it human-generated)
e) maintenance effort is not decelerating as fast as creation effort
Yet, I believe your statement is wrong, in the first place. A lot of new code is created with AI assistance, already and part of the acceleration in AI itself can be attributed to increased use of ai in software engineering (from research to planning to execution).
schipperai 1 days ago [-]
I work for Snowflake and the code I'm building is internal. I'm exploring open sourcing my main project which I built with this system. I'd love to share it one day!
habinero 13 hours ago [-]
Oh god, please don't make me look for another data warehouse.
The long tail of deployable software always strikes at some point, and monetization is not the first thing I think of when I look at my personal backlog.
I also am a tmux+claude enjoyer, highly recommended.
digitalbase 1 days ago [-]
tmux too.
Trying workmux with claude. Really cool combo
hinkley 1 days ago [-]
I’ve known too many developers and seen their half-assed definition of Done-Done.
I actually had a manager once who would say Done-Done-Done. He’s clearly seen some shit too.
haolez 1 days ago [-]
The influencers generate noise, but the progress is still there. The real productivity gains will start showing up at market scale eventually.
onion2k 1 days ago [-]
I'm experimenting with building an agent swarm to take a very large existing app that's been built over the past two decades (internal to the company I work for) and reverse engineer documentation from the code so I can then use that documentation as the basis for my teams to refactor big chunks of old-no-longer-owned-by-anyone features and to build new features using AI better. The initial work to just build a large-scale understanding of exactly what we actually run in prod is a massively parallelizable task that should be a good fit for some documentation writing agents. Early days but so far my experiments seem to be working out.
Obviously no users will see a benefit directly but I reckon it'll speed up delivery of code a lot.
vishnugupta 20 hours ago [-]
> great software
Most software is mundane run of the mill CRUD feature set. Just yesterday I rolled out 5 new web pages and revamped a landing page in under an hour that would have easily taken 3-4 days of back and forth.
There are lot of similar coding happening.
This is the space AI coding truly shines. Repetitive work, all the wiring and routing around adding links, SEO elements and what not.
Either way, you can try to incorporate AI coding in your coding flow and where it takes.
linsomniac 1 days ago [-]
In my view, these agent teams have really only become mainstream in the last ~3 weeks since Claude Code released them. Before that they were out there but were much more niche, like in Factory or Ralphie Wiggum.
There is a component to this that keeps a lot of the software being built with these tools underground: There are a lot of very vocal people who are quick with downvotes and criticisms about things that have been built with the AI tooling, which wouldn't have been applied to the same result (or even poorer result) if generated by human.
This is largely why I haven't released one of the tools I've built for internal use: an easy status dashboard for operations people.
Things I've done with agent teams: Added a first-class ZFS backend to ganeti, rebuilt our "icebreaker" app that we use internally (largely to add special effects and make it more fun), built a "filesystem swiss army knife" for Ansible, converted a Lambda function that does image manipulation and watermarking from Pillow to pyvips, also had it build versions of it in go, rust, and zig for comparison sake, build tooling for regenerating our cache of watermarked images using new branding, have it connect to a pair of MS SQL test servers and identify why logshipping was broken between them, build an Ansible playbook to deploy a new AWS account, make a web app that does a simple video poker app (demo to show the local users group, someone there was asking how to get started with AI), having it brainstorm and build 3 versions of a crossword-themed daily puzzle (just to see what it'd come up with, my wife and I are enjoying TiledWords and I wanted to see what AI would come up with).
Those are the most memorable things I've used the agent teams to build in the last 3 weeks. Many of those things are internal tools or just toys, as another reply said. Some of those are publicly released or in progress for release. Most of these are in addition to my normal work, rather than as a part of it.
schipperai 1 days ago [-]
Further, my POV is that coding agents crossed a chasm only last December with Opus 4.5 release. Only since then these kinds of agent teams setups actually work. It’s early days for agent orchestration
1 days ago [-]
gooob 1 days ago [-]
can you tell us about this "ansible filesystem swiss army knife"?
linsomniac 23 hours ago [-]
I'd be happy to! I find in my playbooks that it is fairly cumbersome to set up files and related because of the module distinction between copying files, rendering templates, directories... There's a lot of boilerplate that has to be repeated.
For 3-4 years I've been toying with this in various forms. The idea is a "fsbuilder" module that make a task that logically groups filesystem setup (as opposed to grouping by operation as the ansible.builtin modules do).
You set up in the main part of the task the defaults (mode, owner/group, etc), then in your "loop" you list the fs components and any necessary overrides for the defaults. The simplest could for example be:
- name: Set up app config
linsomniac.fsbuilder.fsbuilder:
dest: /etc/myapp.conf
Which defaults to a template with the source of "myapp.conf.j2". But you can also do more complex things like:
You're not wrong. The current bottleneck is validation. If you use orchestration to ship faster, you have less time to validate what you're building, and the quality goes down.
If you have a really big test suite to build against, you can do more, but we're still a ways off from dark software factories being viable. I guessed ~3 years back in mid 2025 and people thought I was crazy at the time, but I think it's a safe time frame.
hombre_fatal 21 hours ago [-]
There’s so much more iOS apps being published that it takes a week to get a dev account, review times are longer, and app volume is way up. It’s not really a thing you’re going to notice or not if you’re just going by vibes.
Reebz 1 days ago [-]
People are building for themselves. However I’d also reference www.Every.to
They built the popular compound-engineering plugin and have shipped a set of production grade consumer apps. They offer a monthly subscription and keep adding to that subscription by shipping more tools.
verdverm 1 days ago [-]
There are dozens and dozens of these submitted to Show HN, though increasingly without the title prefix now. This one doesn't seem any more interesting than the others.
schipperai 1 days ago [-]
I picked up a number things from others sharing their setup. While I agree some aspects of these are repetitive (like using md files for planning), I do find useful things here and there.
calvinmorrison 1 days ago [-]
I wrote a Cash flow tracking finance app in Qt6 using claude and have been using it since Jan 1 to replace my old spreadsheets!
I built a Erlang based chat server implementing a JMAP extension that Claude wrote the RFC and then wrote the server for
mrorigo 1 days ago [-]
Erlang FTW. I remember the days at the ol' lab!
calvinmorrison 1 days ago [-]
i have no use for it at my work, i wish i did, so i did this project for run intead.
karel-3d 1 days ago [-]
look at Show HN. Half of it is vibe-coded now.
servercobra 1 days ago [-]
This is a really cool design, pretty similar to what I've built for implementation planning. I like how iterative it is and that the whole system lives just in markdown. The verify step is a great idea I hadn't made a command yet, thank you!
This seems like it'd be great for solo projects but starts to fall apart for a team with a lot more PRs and distributed state. Heck, I run almost everything in a worktree, so even there the state is distributed. Maybe moving some of the state/plans/etc to Linear et al solves that though.
schipperai 1 days ago [-]
Thanks! I mainly work solo so I haven’t tested this setup in a shared project.
nferraz 1 days ago [-]
I liked the way how you bootstrap the agent from a single markdown file.
schipperai 1 days ago [-]
I built so much muscle memory from the original system, so it made sense to apply it to other projects. This was the simplest way to achieve that
aceelric 1 days ago [-]
I’ve been experimenting with a similar pattern but wrapping it in a “factory mode” abstraction (we’re building this at CAS[1]) where you define the spec once after careful planning using a supervisor agent then you let it go and spin up parallel workers against it automatically. It handles task decomposition + orchestration so you’re not manually juggling tmux panes
I think we need much different toolings to go beyond 1 human - 10 agents ratio. And much much different tooling to achieve a higher ratio than that
Scea91 1 days ago [-]
I don't think number of parallel agents is the right productivity metric, or at least you need to account for agent efficiency.
Imagine a superhuman agent who does not need to run in endless loops. It could generate 100k line code-base in a few minutes or solve smaller features in seconds.
In a way, the inefficiency is what leads people to parallelism. There is only room for it because the agents are slow, perhaps the more inefficient and slower the individual agents are, the more parallel we can be.
sluongng 15 hours ago [-]
Yeah, I don't disagree with your assessment at all. I think the H2A ratio is still a good metric for the AI adoption rate of an organization. At a higher H2A ratio, you will also start to hear people measuring things using token volumes, which I think is also a similar metric (because most models nowadays run on a relatively fixed Tokens/second speed).
All of this is not a direct signal to a productivity boost. I think at higher volumes, you will need to start to account for the "yield" rate of the token volumes above: what are the volumes of tokens that get to the final production deployment? At which stage is it a constraint on the yield? Is it the models, or is it the harness, or something else (i.e. Code Review, CI/CD, Security Scans etc...)? And then it becomes an optimization problem to reduce the Cost of Goods Sold while improving/maintaining Revenues. The "productivity" will then be dissolved into multiple separate but more tangible metrics.
schipperai 1 days ago [-]
Few experiments like gas town, the compiler from Anthropic or the browser from Cursor managed to reach the Rocket stage, though in their reports the jagged intelligence of the LLMs was eerily apparent. Do you think we also need better models?
sluongng 1 days ago [-]
I do. The reason why the current generation of agents are good at coding is because the labs have sufficient time and computes to generate synthetic chain-of-thoughts data, feed those data through RL before use them to train the LLMs. These distillation takes time, time which starts from the release of the previous generation of models.
So we are just now getting agents which can reliably loop themselves for medium size tasks. This generation opens a new door towards agent-managing-agents chain of thoughts data. I think we would only get multi-agents with high reliability sometimes by the mid to end of 2026, assuming no major geopolitical disruption.
I love this article. I learned a lot from the OP’s setup although the tools I am using are basically the same with mu setup. I like using vanilla tmux with visual changes. I also use a bash script to manage git worktrees. I have a few slash commands (now skills with no auto invocation) for my workflow.
At the end of the day, I think that it all comes down to building what works for you. But at this point there is no doubt AI will play an important role to speed up workflows and augment one’s capacity.
schipperai 12 hours ago [-]
Thanks! Glad you enjoyed it and found it helpful.
I agree there is no one size fits all (yet). I have looked into a lot of orchestrators and none so far have fit my needs. I prefer my customized simple setup.
hinkley 1 days ago [-]
These setups pretty much require the top tier subscription, right?
0x457 1 days ago [-]
Even Claude Max x1 if you run 2 agents with Opus in parallel you're going hit limits. You can balance model for use case thou, but I wouldn't expect it to work on any $20 plan even if you use Kimi Code.
schipperai 1 days ago [-]
That's a yes from my side.
etyhhgfff 1 days ago [-]
Is one $200 plan sufficient to run 8x Claude Code with Opus 4.6? Or what else you need in terms of subscriptions?
gck1 1 days ago [-]
No. I run a similar setup and with $200 subscription, I usually hit weekly quota by around day 3-4. My approach is 4-5 hours of extreme human in the loop spec sessions with opus and codex:
1. We discuss every question with opus, and we ask for second opinion from codex (just a skill that teaches claude how to call codex) where even I'm not sure what's the right approach
2. When context window reaches ~120k tokens, I ask opus to update the relevant spec files.
3. Repeat until all 3 of us - me, opus and codex are happy or are starting to discuss nitpicks, YAGNIs. Whichever earlier.
Then it's fully autonomous until all agents are happy.
Which is why I'm exploring optimization strategies. Based on the analysis of where most of the tokens are spent for my workflow, roughly 40% of it is thinking tokens with "hmm not sure, maybe..", 30% is code files.
So two approaches:
1. Have a cheap supervisor agent that detects that claude is unsure about something (which means spec gap) and alerts me so that I can step in
2. "Oracle" agent that keeps relevant parts of codebase in context and can answer questions from builder agents.
And also delegating some work to cheaper models like GLM where top performance isn't necessary.
You'll notice that as soon as you reach a setup you like that actually works, $200 subscription quotas will become a limiting factor.
hinkley 1 days ago [-]
That does seem to argue for the checkpointing strategy of having the agent explain their plan and then work on it incrementally. When you run out of tokens you either switch projects until your quota recovers or you proceed by hand until the quota recovers.
I also kinda expect that one of the saner parts of agentic development is the skills system, that skills can be completely deterministic, and that after the Trough of Disillusionment people will be using skills a lot more and AI a lot less.
gck1 1 days ago [-]
Yes on both counts. Implementation plan is a second layer after the spec is written, at which point, spec can't be changed by agents. I then launch a planner agent that writes a phased plan file and each builder can only work on a single phase from that file.
So it's spec (human in the loop) > plan > build. Then it cycles autonomously in plan > build until spec goals are achieved. This orchestration is all managed by a simple shell script.
But even with the implementation plan file, a new agent has to orient itself, load files it may later decide were irrelevant, the plan may have not been completely correct, there could have been gaps, initial assumptions may not hold, etc. It then starts eating tokens.
And it feels like this can be optimized further.
And yes on deterministic tooling as well.
gck1 1 days ago [-]
More like 2x$200 plans.
hinkley 24 hours ago [-]
So that's kind of a non-starter for self-directed learning.
kledru 1 days ago [-]
I think you should have a reviewer as well.
schipperai 1 days ago [-]
I have /fd-verify which I execute with the Worker after its done implementing. I didn’t feel the need to have a separate window / agent for reviewing. The same Worker can review its own code. What would be the benefits of having a separate Reviewer?
kledru 1 days ago [-]
ok -- I am currently quite impressed with a dedicated verifier that has large degree of freedom (very simple prompt). At least when it comes to backend work.
kledru 1 days ago [-]
sorry, reviewer. Github issues used by implementer and reviewer for back-and-forth
zwilderrr 1 days ago [-]
I just can’t get over the fact that your Anglicized name sounds like manual shipper.
schipperai 1 days ago [-]
it is ironic
philipp-gayret 1 days ago [-]
Is there a place where people like you go to share ideas around these new ways of working, other than HN? I'm very curious how these new ways of working will develop. In my system, I use voice memo's to capture thoughts and they become more or less what you have as feature designs. I notice I have a lot of ideas throughout the day (Claude chews through them some time later, and when they are worked out I review its plans in Notion; I use Notion because I can upload memos into it from my phone so it's more or less what you call the index). But ideas.. I can only capture them as they come, otherwise they are lost & I don't want to spend time typing them out.
schipperai 1 days ago [-]
I have only seen similar posts in HN or X. I’d be curious if there are more.
renewiltord 16 hours ago [-]
I don’t find humans mapping to agents worthwhile. Claude produces machine agents of weird structure who are aware of some fractal subfraction of the code and this works well.
Regardless, the one thing that I do find useful is a markdown task list because this survives context damage. This is a harness workaround that I fully anticipate will be dealt with in Claude Code itself.
aplomb1026 23 hours ago [-]
[dead]
aplomb1026 1 days ago [-]
[dead]
mrorigo 1 days ago [-]
[dead]
Rendered at 23:36:59 GMT+0000 (Coordinated Universal Time) with Vercel.
The pattern I've converged on: spend the first 30 minutes writing detailed markdown specs (inputs, outputs, edge cases, integration points), then let Claude Code chew through the implementation while I review, test, and iterate. For a typical automation project — say a WhatsApp bot that handles booking flows and integrates with a client's CRM — this cuts delivery time roughly in half compared to writing everything manually.
The biggest practical lesson: the spec quality is everything. A vague spec produces code you'll spend more time debugging than you saved. A good spec with explicit error handling expectations, API response formats, and state transitions produces code that's 80-90% production-ready on the first pass.
Where I disagree slightly with the parallel agent approach: for client-facing work where correctness matters more than speed, I've found 2-3 focused agents (one on backend, one on frontend, one on tests) more reliable than 6-8 competing agents that create merge conflicts. The overhead of resolving conflicts and ensuring consistency across parallel outputs eats into the productivity gains fast.
Just something that tells the LLM (and me, as I tend to forget) what is the actual purpose of the project and what are the next features to be added.
In many cases the direction tends to get lost and the AI starts adding features like it's doing a multi-user SaaS or helfully adding things that aren't in the scope for the project because I have another project doing that already.
The agent wakes up every 2 hours with zero memory (no persistent context), reads its own STATE.md to understand what past iterations did, evaluates what's working, and acts. The STATE.md is simultaneously the spec AND the execution log.
What I've found: the biggest failure mode isn't the agent making bad decisions — it's the agent rediscovering its own architecture every run. Each time it reads STATE.md, it's essentially a new agent with inherited knowledge. This creates an interesting pressure to write STATE.md entries that are actually useful to a future self, not just status updates.
The spec-driven approach here solves the same problem from the other direction: the spec gives the agent enough context to act without needing to reconstruct context from scratch.
The tmux session management also resonates — we use a LaunchAgent plist for the same reason: consistent environment with keychain access, no cron auth issues.
Live experiment if curious: https://frog03-20494.wykr.es
- Base Claude Code (released)
- Extensive, self-orchestrated, local specs & documentation; ie waterfall for many features/longer term project goals (summer)
- Base Claude Code (today)
Claude Code is getting better at orchestrating it's own subagents for divide/conquer type work.
My problem with these extensive self-orchestrated multi-agent / spec modes is the type of drift and rot of all the changes and then integrated parts of an application that a lot of the time end up in merge conflicts. Aside from my own decision cognitive space, it's also a lot to just generally orchestrate and review. I spent a ton of type enforcing Claude to use the system I put in place including documentation updates and continuous logging of work.
I feel extremely productive with a single Claude Code for a project. Maybe for minor features, I'll launch Claude Code in the web so that it can operate in an isolated space to knock them out and create a PR.
I will plan and annotate extensively for large features, but not many features or broad project specs all at the same time. Annotation and better planning UX, I think, are going to be increasingly important for now. The only augment of Claude Code I have is a hook for plan mode review: https://github.com/backnotprop/plannotator
I have between 3 and 6 hours per day where I can sit in front of a laptop and work directly with the code. The more of the actual technical planning/coding/testing/bug fixing loop I can get done in that time the better. If I can send out multiple agents to implement a plan I wrote yesterday, while another agent fixes lint and type errors, a third agent or two or three are working with me on either brainstorming or new plans, that's great! I'll go out for a walk in the park and think deeply during the rest of the day.
When people hear about all of these agent - working on three plans at once, really? - it sounds overwhelming. But realistically there's a lot of downtime from both sides. I ask the agent a question, it spends 5-10 minutes exploring. During that time I check on another agent or read some code that has been generated, or do some research of my own. Then I'll switch back to that terminal when I'm ready and ask a follow up question, mark the plan as ready, or whatever.
The worst thing I did when I was first getting excited about how agents were good now, a whole two months ago, was set things up so I could run a terminal on my phone and start sessions there. That really did destroy my deep thinking time, and lasted for about 3 days before I deleted termux.
- Research
- Scan the web
- Text friends
- Side projects
- Take walks outside
etc
Any ideas?
I'm mostly sticking to a codex workflow. I transitioned from the cli to their app when they released it a few weeks ago and I'm pretty happy with that. I've had to order extra tokens a few times but most weeks I get by on the 20$ Chat GPT Plus subscription. That's not really compatible with burning hundreds/thousands on using lots of parallel agents in any case.
I also have a hunch that there are some fast diminishing returns on that kind of spending. At least, I seem to get a lot of value out of just spending 20/month. A lot of that more extreme burn might just be tool churn / inefficiency.
With teams, basically you should organize around CI/CD, pull requests and having code reviews (with or without AI assists). Standard stuff; you should be doing that anyway. But doubling down on making this process fast and efficient pays off. With LLMs the addition to this would be codifying/documenting key skills in your repositories for doing stuff with your code base and ways of working. A key thing in teams is to own and iterate on that stuff and not let it just rot. PRs against that should be well reviewed and coordinated and not just sneaked in.
Otherwise, AI usage just increases the volume of PRs and changes. Most of these tools in any case work a lot better if you have a good harness around your workflow that allows it to run linting/tests, etc. If you have good CI, this shouldn't be hard to express in skill form. The issue then becomes making sure the team gets good at producing high quality PRs and processing them efficiently. If you are dealing with a lot of conflicts, PR scope creep, etc. that's probably not optimal.
A lot of stuff related to coordinating via issue trackers can also be done with agents. If you have gh cli set up, it can actually create, label, etc. or act on github issues. That opens the door to also using LLMs for broader product management. It's something I've been meaning to experiment with more. But for bigger teams that could be something to lean on more. LLMs filing lots of issues is only helpful if you have the means to stay on top of that. That requires workflows where a lot of issues are short lived (time to some kind of resolution). This is not something many teams are good at currently.
Ended up flipping the model — instead of blocking bad actions, require proof of safety before any action runs. No proof, no action. Much harder to route around.
Curious if you've tried anything similar.
For example: 1) If it wants to delete a file, it has to output the exact path it thinks it’s deleting. I normalize it and make sure it’s inside the project root. If not, I block it. 2) If it proposes a big change, I require a diff first instead of letting it execute directly. 3) After code changes, I run tests or at least a lint/type check before accepting it.
So it’s less about formal proofs and more about forcing the agent to surface assumptions in a structured way, then verifying those assumptions mechanically.
Still hacky, but it reduced the “creative workaround” behavior a lot.
I recently added a snippet asking Claude to not try to bypass the deny list. I didn't have an incidence since but Im still nervous... Claude once bypassed the deny list and nuked an important untracked directory which caused me lots of trouble.
It's like OpenClaw for me — I love the idea of agentic computer use; but I just don't see how something so unsupervised and unsupervisable is remotely a useful or good idea.
The spec file helps, but we found we also needed a short shared "ground truth" file the agents could read before taking any action - basically a live snapshot of what's actually done vs what the spec says. Without it, two agents would sometimes solve the same problem in incompatible ways.
Has anyone found a clean way to sync context across parallel sessions without just dumping everything into one massive file?
We hit the same wall running parallel Claude Code sessions. Ended up switching to ctlsurf, which gives agents structured blocks via MCP -- task tables, key-value state, append-only logs. An agent can check "is anyone already on the auth module" by querying a table, not parsing prose. State survives across sessions too, so a new agent doesnt start from scratch.
The drift problem across parallel agents is exactly what convinced me flat files werent enough.
The NERDs approach reminds me of how we ended up thinking about it from the browser automation side too: the thing you actually need to share isn't history, it's a live description of each component's current state and its dependencies. We got there through trial and error but it's interesting to see the same insight in a formal model.
Reading the paper now - curious how you handle NERDs for components that change state frequently at runtime (vs design-time decisions). Browser sessions are a good example - the "entity" changes with every request.
I only run a couple of agents at a time, but with Beads you can create issues, then agents can assign them to themselves, etc. Agents or the human driver can also add context in epics, and I think you can have perpetual issues which contain context too. Or could make them as a type of issue yourself, it’s a very flexible system.
[1] https://github.com/steveyegge/beads
What does your spec file look like when you kick off a new agent? Curious if you start from scratch each time or carry over context from previous sessions on the same project.
The Planners/Workers split is interesting too. Do you find the Worker agents stay within scope reliably, or do you still end up with surprises when they go off-plan?
The routing layer uses tmux: `agent-doc claim`, `route`, `focus`, `layout` commands manage which pane owns which document, scoped to tmux windows. A JetBrains plugin lets you submit from the IDE with a hotkey — it finds the right pane and sends the skill command.
For context sync across agents, the key insight was: don't sync. Each agent owns one document with its own conversation history. The orchestration doc (plan.md) references feature docs but doesn't duplicate their content. When an agent finishes a feature, its key decisions get extracted into SPEC.md. The documents ARE the shared context — any agent can read any document.
It's been working well for running 4-6 parallel sessions across corky (email client), agent-doc itself, and a JetBrains plugin — all from one tmux window with window-scoped routing.
[1] https://github.com/btakita/agent-doc
The SPEC.md as the extraction target after a feature is done is a nice touch. In our case the tricky part is that browser automation state is partly external - you have sessions, cookies, proxy assignments that live outside the codebase. So the "ground truth" we needed wasn't just about code decisions but about runtime state too. Ended up logging that separately.
Checking out agent-doc, the snapshot-based diff to avoid re-triggering on comments is clever. Does it handle cases where two agents edit the same doc around the same time, or is the ownership model strict enough that this doesn't come up?
On concurrent editing — it's handled at two levels:
*Ownership:* Each document is claimed by one tmux pane (one agent session). The routing layer prevents two agents from working the same doc simultaneously.
*3-way merge:* If I edit the document while the agent is mid-response, agent-doc detects the change on write-back and runs `git merge-file --diff3` — baseline (pre-commit), agent response, and my concurrent edits all merge. Non-overlapping changes merge cleanly; overlapping changes get conflict markers. Nothing is silently dropped.
The pre-submit git commit is the key — it creates an immutable baseline before the agent touches anything, so there's always a clean reference point for the merge.
The only solution I've seen on a Mac is doing it on a separate monitor.
I couldn't find a solution here and have built similar things in the past so I took a crack at it using CGVirtualDisplay.
Ended up adding a lot of productivity features and polished until it felt good.
Curious if there are similar solutions out there I just haven't seen.
https://github.com/jasonjmcghee/orcv
Most of what I'm seeing is AI influencers promoting their shovels.
This is such a new and emerging area that I don't understand how this is a constructive comment on any level.
You can be skeptical of the technology in good faith, but I think one shouldn't be against people being curious and engaging in experimentation. A lot of us are actively trying to see what exactly we can build with this, and I'm not an AI influencer by any means. How do we find out without trying?
I still feel like we're still at a "building tools to build tools" stage in multi-agent coding. A lot of interesting projects springing up to see if they can get many agents to effectively coordinate on a project. If anything, it would be useful to understand what failed and why so one can have an informed opinion.
To put a statement like that into perspective (50 times more productive): The first week of the year about as much was accomplished as the whole previous year put together.
But with AI assistance I've made SO MANY "useful", "handy" and "nifty" tools that I would've never bothered to spend the time on.
Like just last night I had Claude make a shell script on a whim that lets me use fzf to choose a running tmux session - with a preview of what the session's screen looks like.
Could I make it by hand? Yep. Would I have bothered? Most likely no.
Now it got done and iterated on my second monitor while I was watching 21 Bridges on my main monitor and eating snacks. (Chadwick Boseman was great in it)
But building software does tend to come with a lag even with AI. And we're also just more likely to see its influence in existing software first.
I'd rather be asking where it is AND actively trying to explore this space so I have a better grasp of the engineering challenges. I think there's just too many interesting things happening to be able to just wave it off.
We have 500+ custom rules that are context sensitive because I work on a large and performance sensitive C++ codebase with cooperative multitasking. Many things that are good are non-intuitive and commercial code review tools don't get 100% coverage of the rules. This took a lot of senior engineering time to review.
Anyways, I set up a massive parallel agent infrastructure in CI that chunks the review guidelines into tickets, adds to a queue, and has agents spit up GitHub code review comments. Then a manager agent validates the comments/suggestions using scripts and posts the review. Since these are coding agents they can autonomously gather context or run code to validate their suggestions.
Instantly reduced mean time to merge by 20% in an A/B test. Assuming 50% of time on review, my org would've needed 285 more review hours a week for the same effect. Super high signal as well, it catches far more than any human can and never gets tired.
Likewise, we can scale this to any arbitrary review task, so I'm looking at adding benchmarking and performance tuning suggestions for menial profiling tasks like "what data structure should I use".
Heard a presentation from one of their AI engineers where they had a few slides about using multi-agent systems with different focuses looking through the code before a single human is pinged to look at the pull request.
Unfortunately I didn't graduate from Waterloo nor did I have referrals last year, so Google autorejects me from even forward deployed engineer roles without even giving me an OA.
Instead I get to maintain this myself for several hundred developers as a junior and get all my guidance from HN.
That sounds like a completely made up bullshit number that a junior engineer would put on a resume. There’s absolutely no way you have enough data to state that with anything approaching the confidence you just did.
It is based on $125/hr and it assumes review time is inversely proportional to number of review hours.
Then time to merge can be modelled as
T_total = T_fixed + T_review
where fixed time is stuff like CI. For the sake of this T_fixed = T_review i.e. 50% of time is spent in review. (If 100% of time is spent in review it's more like $800k so I'm being optimistic)
T_review is proportional to 1/(review hours).
We know the T_total has been reduced by 23.4% in an A/B test, roughly, due to this AI tool, so I calculate how much equivalent human reviewer time would've been needed to get the same result under the above assumptions. This creates the following system of equations:
T_total_new = T_fixed + T_review_new
T_total_new = T_total * (1 - r)
where r = 23.4%. This simplifies to:
T_review_new = T_review - r * T_total
since T_review / T_review_new = capacity_new / capacity_old (because inverse proportionality assumption). Call this capacity ratio `d`. Then d simplifies to:
d = 1/(1 - r/(T_review/T_total))
T_review/T_total is % of total review time spent on PR, so we call that `a` and get the expression:
d = 1 / (1 - r/a)
Then at 50% of total time spent on review a=0.5 and r = 0.234 as stated. Then capacity ratio is calculated at:
d ≈ 1.8797
Likewise, we have like 40 reviewers devoting 20% of a 40 hr workweek giving us 320 hours. Multiply by original d and get roughly 281.504 hours of additional time or $31588/week which over 52 weeks is little over $1.8 million/year.
Ofc I think we cost more than $125 once you consider health insurance and all that, likewise our reviewers are probably not doing 20% of their time consistently, but all of those would make my dollar value higher.
The most optimistic assumption I made is 50% of time spent on review.
But even if that is correct you need a much longer time frame to tell if reviews using this new tool are equivalent as a quality control measure.
And you have so many assumptions built in to this that are your number is worthless. You aren’t controlling for all the variables you need to control for. How do you know that workers spend 8 hours a week on reviews vs spending 2 hours and slacking off the other 6 hours? How do you know that the change of process created by using this tool doesn’t just cause the reviewers to work harder, but they’ll stop doing that once the novelty wears off? What if reviewers start relying on this tool to catch a certain class of errors for which it has low sensitivity?
It’s also a moot point if they don’t actually end up saving the money you say they will. It could be that all the savings is eaten up because of the reviewers just use the extra time to dick around on hacker news. It could just be that people aren’t able to make productive use of their time saved. Maybe they were already maxing out their time doing other useful activities.
All of this screams junior engineer took very limited results and extrapolated to say “saved the company millions” without nearly enough supporting evidence. Run your tool for 6 months, take an actual business outcome like time to merge PRs, measure that, and put that on your resume.
It’s incredibly common for a junior engineer to create some new tooling, and come up with some numbers to justify how this new tooling saves the company millions in labor. I have never once seen these “savings” actually pan out.
> All of this screams junior engineer took very limited results and extrapolated to say “saved the company millions” without nearly enough supporting evidence.
That's what the only person in my major who got a job at FAANG in California did, which is why I borrowed the strategy since it seems to work.
> I can almost guarantee you that an A/B test design wasn’t rigorous enough for you to be that confident in your numbers.
Shoot me an email about methodology! It's my username at gmail. I'd be happy to get more mentorship about more rigorous strategies and I can respond to concerns in less of a PR voice.
Overall effort was a few days of agentic vibe-coding over a period of about 3 weeks. Would have been faster, but the parallel agents burn though tokens extremely quickly and hit Max plan limits in under an hour.
1. https://github.com/ecliptik/fluxland
The jury is still very far out on how agentic development affects mid/long term speed and quality. Those feedback cycles are measured in years, not weeks. If we bother to measure at all.
People in our field generally don't do what they know works, because by and large, nobody really knows, beyond personal experiences, and I guess a critical mass doesn't even really care. We do what we believe works. Programming is a pop culture.
Can also after those sessions where they get stuff wrong, ask for an analysis of what it got wrong that session, and produce a ranked list. I just started that and wow, it comes up with pretty solid lists. I'm not sure if its sustainable to simply consolidate and prune it, but maybe it is?
Most tests people write have to be changed if you refactor.
Now these things are being made. I can justify spending 5-10 minutes on something without being upset if AI can't solve the problem yet.
And if not, I'll try again in 6 months. These aren't time sensitive problems to begin with or they wouldn't be rotting on the back burner in the first place.
Where does one get started?
How do you manage multiple agents working in parallel on a single project? Surely not the same working directory tree, right? Copies? Different branches / PRs?
You can't use your Claude Code login and have to pay API prices, right? How expensive does it get?
Set an env var and ask to create a team. If you're running in tmux it will take over the session and spawn multiple agents all coordinated through a "manager" agent. Recommend running it sandboxed with skip-dangerous-permissions otherwise it's endless approvals
Churns through tokens extremely quickly, so be mindful of your plan/budget.
1. https://code.claude.com/docs/en/agent-teams
Obv, work on things that don't affect each other, otherwise you'll be asking them to look across PRs and that's messy.
a) learning and adapting is at first more effort, not less, b) learning with experiments is faster, c) experiencing the acceleration first hand is demoralising, d) distribution/marketing is on an accelerated declining efficiency trajectory (if you want to keep it human-generated) e) maintenance effort is not decelerating as fast as creation effort
Yet, I believe your statement is wrong, in the first place. A lot of new code is created with AI assistance, already and part of the acceleration in AI itself can be attributed to increased use of ai in software engineering (from research to planning to execution).
The long tail of deployable software always strikes at some point, and monetization is not the first thing I think of when I look at my personal backlog.
I also am a tmux+claude enjoyer, highly recommended.
Trying workmux with claude. Really cool combo
I actually had a manager once who would say Done-Done-Done. He’s clearly seen some shit too.
Obviously no users will see a benefit directly but I reckon it'll speed up delivery of code a lot.
Most software is mundane run of the mill CRUD feature set. Just yesterday I rolled out 5 new web pages and revamped a landing page in under an hour that would have easily taken 3-4 days of back and forth.
There are lot of similar coding happening.
This is the space AI coding truly shines. Repetitive work, all the wiring and routing around adding links, SEO elements and what not.
Either way, you can try to incorporate AI coding in your coding flow and where it takes.
There is a component to this that keeps a lot of the software being built with these tools underground: There are a lot of very vocal people who are quick with downvotes and criticisms about things that have been built with the AI tooling, which wouldn't have been applied to the same result (or even poorer result) if generated by human.
This is largely why I haven't released one of the tools I've built for internal use: an easy status dashboard for operations people.
Things I've done with agent teams: Added a first-class ZFS backend to ganeti, rebuilt our "icebreaker" app that we use internally (largely to add special effects and make it more fun), built a "filesystem swiss army knife" for Ansible, converted a Lambda function that does image manipulation and watermarking from Pillow to pyvips, also had it build versions of it in go, rust, and zig for comparison sake, build tooling for regenerating our cache of watermarked images using new branding, have it connect to a pair of MS SQL test servers and identify why logshipping was broken between them, build an Ansible playbook to deploy a new AWS account, make a web app that does a simple video poker app (demo to show the local users group, someone there was asking how to get started with AI), having it brainstorm and build 3 versions of a crossword-themed daily puzzle (just to see what it'd come up with, my wife and I are enjoying TiledWords and I wanted to see what AI would come up with).
Those are the most memorable things I've used the agent teams to build in the last 3 weeks. Many of those things are internal tools or just toys, as another reply said. Some of those are publicly released or in progress for release. Most of these are in addition to my normal work, rather than as a part of it.
For 3-4 years I've been toying with this in various forms. The idea is a "fsbuilder" module that make a task that logically groups filesystem setup (as opposed to grouping by operation as the ansible.builtin modules do).
You set up in the main part of the task the defaults (mode, owner/group, etc), then in your "loop" you list the fs components and any necessary overrides for the defaults. The simplest could for example be:
Which defaults to a template with the source of "myapp.conf.j2". But you can also do more complex things like: I am using this extensively in our infrastructure and run ~20 runs a day, so it's fairly well tested.More information at: https://galaxy.ansible.com/ui/repo/published/linsomniac/fsbu...
If you have a really big test suite to build against, you can do more, but we're still a ways off from dark software factories being viable. I guessed ~3 years back in mid 2025 and people thought I was crazy at the time, but I think it's a safe time frame.
They built the popular compound-engineering plugin and have shipped a set of production grade consumer apps. They offer a monthly subscription and keep adding to that subscription by shipping more tools.
https://git.ceux.org/cashflow.git/
This seems like it'd be great for solo projects but starts to fall apart for a team with a lot more PRs and distributed state. Heck, I run almost everything in a worktree, so even there the state is distributed. Maybe moving some of the state/plans/etc to Linear et al solves that though.
[1] https://cas.dev
https://open.substack.com/pub/sluongng/p/stages-of-coding-ag...
I think we need much different toolings to go beyond 1 human - 10 agents ratio. And much much different tooling to achieve a higher ratio than that
Imagine a superhuman agent who does not need to run in endless loops. It could generate 100k line code-base in a few minutes or solve smaller features in seconds.
In a way, the inefficiency is what leads people to parallelism. There is only room for it because the agents are slow, perhaps the more inefficient and slower the individual agents are, the more parallel we can be.
All of this is not a direct signal to a productivity boost. I think at higher volumes, you will need to start to account for the "yield" rate of the token volumes above: what are the volumes of tokens that get to the final production deployment? At which stage is it a constraint on the yield? Is it the models, or is it the harness, or something else (i.e. Code Review, CI/CD, Security Scans etc...)? And then it becomes an optimization problem to reduce the Cost of Goods Sold while improving/maintaining Revenues. The "productivity" will then be dissolved into multiple separate but more tangible metrics.
So we are just now getting agents which can reliably loop themselves for medium size tasks. This generation opens a new door towards agent-managing-agents chain of thoughts data. I think we would only get multi-agents with high reliability sometimes by the mid to end of 2026, assuming no major geopolitical disruption.
At the end of the day, I think that it all comes down to building what works for you. But at this point there is no doubt AI will play an important role to speed up workflows and augment one’s capacity.
I agree there is no one size fits all (yet). I have looked into a lot of orchestrators and none so far have fit my needs. I prefer my customized simple setup.
1. We discuss every question with opus, and we ask for second opinion from codex (just a skill that teaches claude how to call codex) where even I'm not sure what's the right approach 2. When context window reaches ~120k tokens, I ask opus to update the relevant spec files. 3. Repeat until all 3 of us - me, opus and codex are happy or are starting to discuss nitpicks, YAGNIs. Whichever earlier.
Then it's fully autonomous until all agents are happy.
Which is why I'm exploring optimization strategies. Based on the analysis of where most of the tokens are spent for my workflow, roughly 40% of it is thinking tokens with "hmm not sure, maybe..", 30% is code files.
So two approaches: 1. Have a cheap supervisor agent that detects that claude is unsure about something (which means spec gap) and alerts me so that I can step in 2. "Oracle" agent that keeps relevant parts of codebase in context and can answer questions from builder agents.
And also delegating some work to cheaper models like GLM where top performance isn't necessary.
You'll notice that as soon as you reach a setup you like that actually works, $200 subscription quotas will become a limiting factor.
I also kinda expect that one of the saner parts of agentic development is the skills system, that skills can be completely deterministic, and that after the Trough of Disillusionment people will be using skills a lot more and AI a lot less.
So it's spec (human in the loop) > plan > build. Then it cycles autonomously in plan > build until spec goals are achieved. This orchestration is all managed by a simple shell script.
But even with the implementation plan file, a new agent has to orient itself, load files it may later decide were irrelevant, the plan may have not been completely correct, there could have been gaps, initial assumptions may not hold, etc. It then starts eating tokens.
And it feels like this can be optimized further.
And yes on deterministic tooling as well.
Regardless, the one thing that I do find useful is a markdown task list because this survives context damage. This is a harness workaround that I fully anticipate will be dealt with in Claude Code itself.