the agentic shift is where the legal and insurance worlds are really going to struggle. we know how to model human error, but modeling an autonomous loop that makes a chain of small decisions leading to a systemic failure is a whole different beast. the audit trail requirements for these factories are going to be a regulatory nightmare.
Alex_L_Wood 15 minutes ago [-]
>If you haven't spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement
…What am I even reading? Am I crazy to think this is a crazy thing to say, or it’s actually crazy?
gassi 11 minutes ago [-]
My favorite conspiracy theory is that these projects/blog posts are secretly backed by big-AI tech companies, to offset their staggering (unsustainable) losses by convincing executives to shovel pools of money into AI tools.
delusional 12 minutes ago [-]
It's crazy if you're an engineer. It's pretty common for middle managers to quantify "progress" in terms of "spend".
My bosses bosses boss like to claim that we're successfully moving to the cloud because the cost is increasing year over year.
dexwiz 20 seconds ago [-]
Growth will be proportional to spend. You can cut waste later and celebrate efficiency. So when growing there isn't much incentive to do it efficiently. You are just robbing yourself of a potential future victory. Also it's legitimately difficult to maximize growth while prioritizing efficiency. It's like how a body builder cycles between bulking and cutting. For mid to long term outlooks it's probably the best strategy.
FuckButtons 6 minutes ago [-]
Appropriate username.
noosphr 3 hours ago [-]
I was looking for some code, or a product they made, or anything really on their site.
Building Attractor
Supply the following prompt to a modern coding agent
(Claude Code, Codex, OpenCode, Amp, Cursor, etc):
codeagent> Implement Attractor as described by
https://factory.strongdm.ai/
Canadian girlfriend coding is now a business model.
I've looked at their code for a few minutes in a few files, and while I don't know what they're trying to do well enough to say for sure anything is definitely a bug, I've already spotted several things that seem likely to be, and several others that I'd class as anti-patterns in rust. Don't get me wrong, as an experiment this is really cool, but I do not think they've succeeded in getting the "dark factory" concept to work where every other prominent attempt has fallen short.
simonw 2 hours ago [-]
Out of interest, what anti-patterns did you see?
(I'm continuing to try to learn Rust!)
lunar_mycroft 51 minutes ago [-]
To pick a few (from the server crate, because that's where I looked):
- The StoreError type is stringly typed and generally badly thought out. Depending on what they actually want to do, they should either add more variants to StoreError for the difference failure cases, replaces the strings with a sub-types (probably enums) to do the same, or write an type erased error similar to (or wrapping) the ones provided by anyhow, eyre, etc, but with a status code attached. They definitely shouldn't be checking for substrings in their own error type for control flow.
- So many calls to String::clone [0]. Several of the ones I saw were actually only necessary because the function took a parameter by reference even though it could have (and I would argue should have) taken it by value (If I had to guess, I'd say the agent first tried to do it without the clone, got an error, and implemented a local fix without considering the broader context).
- A lot of errors are just ignored with Result::unwrap_or_default or the like. Sometimes that's the right choice, but from what I can see they're allowing legitimate errors to pass silently. They also treat the values they get in the error case differently, rather than e.g. storing a Result or Option.
- Their HTTP handler has an 800 line long closure which they immediately call, apparently as a substitute for the the still unstable try_blocks feature. I would strongly recommend moving that into it's own full function instead.
- Several if which should have been match.
- Calls to Result::unwrap and Option::unwrap. IMO in production code you should always at minimum use expect instead, forcing you to explain what went wrong/why the Err/None case is impossible.
It wouldn't catch all/most of these (and from what I've seen might even induce some if agents continue to pursue the most local fix rather than removing the underlying cause), but I would strongly recommend turning on most of clippy's lints if you want to learn rust.
For those of us working on building factories, this is pretty obvious because once you immediately need shared context across agents / sessions and an improved ID + permissions system to keep track of who is doing what.
yomismoaqui 3 hours ago [-]
I don't know if that is crazy or a glimpse of the future (could be both).
PS: TIL about "Canadian girlfriend", thanks!
ares623 3 hours ago [-]
I was about to say the same thing! Yet another blog post with heaps of navel gazing and zero to actually show for it.
The worst part is they got simonw to (perhaps unwittingly or social engineering) vouch and stealth market for them.
And $1000/day/engineer in token costs at current market rates? It's a bold strategy, Cotton.
But we all know what they're going for here. They want to make themselves look amazing to convince the boards of the Great Houses to acquire them. Because why else would investors invest in them and not in the Great Houses directly.
simonw 2 hours ago [-]
The "social engineering" is that I was invited to a demo back in October and thought it was really interesting.
(Two people who's opinions I respect said "yeah you really should accept that invitation" otherwise I probably wouldn't have gone.)
I've been looking forward to being able to write more details about what they're doing ever since.
ucirello 2 hours ago [-]
Justin never invites me in when he brings the cool folks in! Dang it...
ares623 2 hours ago [-]
I will look forward to that blog post then, hopefully it has more details than this one.
EDIT nvm just saw your other comment.
navanchauhan 50 minutes ago [-]
I think this comment is slightly unfair :(
We’ve been working on this since July, and we shared the techniques and principles that have been working for us because we thought others might find them useful. We’ve also open-sourced the nlspec so people can build their own versions of the software factory.
We’re not selling a product or service here. This also isn’t about positioning for an acquisition: we’ve already been in a definitive agreement to be acquired since last month.
It’s completely fair to have opinions and to not like what we’re putting out, but your comment reads as snarky without adding anything to the conversation.
ebhn 3 hours ago [-]
That's hilarious
CuriouslyC 4 hours ago [-]
Until we solve the validation problem, none of this stuff is going to be more than flexes. We can automate code review, set up analytic guardrails, etc, so that looking at the code isn't important, and people have been doing that for >6 months now. You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.
There are higher and lower leverage ways to do that, for instance reviewing tests and QA'ing software via use vs reading original code, but you can't get away from doing it entirely.
kaicianflone 3 hours ago [-]
I agree with this almost completely. The hard part isn’t generation anymore, it’s validation of intent vs outcome. Especially once decisions are high-stakes or irreversible, think pkg updates or large scale tx
What I’m working on (open source) is less about replacing human validation and more about scaling it: using multiple independent agents with explicit incentives and disagreement surfaced, instead of trusting a single model or a single reviewer.
Humans are still the final authority, but consensus, adversarial review, and traceable decision paths let you reserve human attention for the edge cases that actually matter, rather than reading code or outputs linearly.
Until we treat validation as a first-class system problem (not a vibe check on one model’s answer), most of this will stay in “cool demo” territory.
sonofhans 3 hours ago [-]
“Anymore?” After 40 years in software I’ll say that validation of intent vs. outcome has always been a hard problem. There are and have been no shortcuts other than determined human effort.
kaicianflone 3 hours ago [-]
I don’t disagree. After decades, it’s still hard which is exactly why I think treating validation as a system problem matters.
We’ve spent years systematizing generation, testing, and deployment. Validation largely hasn’t changed, even as the surface area has exploded. My interest is in making that human effort composable and inspectable, not pretending it can be eliminated.
cronin101 4 hours ago [-]
This obviously depends on what you are trying to achieve but it’s worth mentioning that there are languages designed for formal proofs and static analysis against a spec, and I have suspicions we are currently underutilizing them (because historically they weren’t very fun to write, but if everything is just tokens then who cares).
And “define the spec concretely“ (and how to exploit emerging behaviors) becomes the new definition of what programming is.
svilen_dobrev 3 hours ago [-]
> “define the spec concretely“
(and unambiguously. and completely. For various depths of those)
This always has been the crux of programming. Just has been drowned in closer-to-the-machine more-deterministic verbosities, be it assembly, C, prolog, js, python, html, what-have-you
There have been a never ending attempts to reduce that to more away-from-machine representation. Low-code/no-code (anyone remember Last-one for Apple ][ ?), interpreting-and/or-generating-off DSLs of various level of abstraction, further to esperanto-like artificial reduced-ambiguity languages... some even english-like..
For some domains, above worked/works - and the (business)-analysts became new programmers. Some companies have such internal languages. For most others, not really.
And not that long ago, the SW-Engineer job was called Analyst-programmer.
But still, the frontier is there to cross..
varispeed 3 hours ago [-]
AI also quickly goes off the rails, even the Opus 2.6 I am testing today. The proposed code is very much rubbish, but it passes the tests. It wouldn't pass skilled human review. Worst thing is that if you let it, it will just grow tech debt on top of tech debt.
feastingonslop 5 minutes ago [-]
The code itself does not matter. If the tests pass, and the tests are good, then who cares? AI will be maintaining the code.
simianwords 3 hours ago [-]
did you read the article?
>StrongDM’s answer was inspired by Scenario testing (Cem Kaner, 2003).
CuriouslyC 3 hours ago [-]
Tests are only rigorous if the correct intent is encoded in them. Perfectly working software can be wrong if the intent was inferred incorrectly. I leverage BDD heavily, and there a lot of little details it's possible to misinterpret going from spec -> code. If the spec was sufficient to fully specify the program, it would be the program, so there's lots of room for error in the transformation.
simianwords 3 hours ago [-]
Then I disagree with you
> You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.
You don't need a human who knows the system to validate it if you trust the LLM to do the scenario testing correctly. And from my experience, it is very trustable in these aspects.
Can you detail a scenario by which an LLM can get the scenario wrong?
politelemon 3 hours ago [-]
I do not trust the LLM to do it correctly. We do not have the same experience with them, and should not assume everyone does. To me, your question makes no sense to ask.
simianwords 3 hours ago [-]
We should be able to measure this. I think verifying things is something an llm can do better than a human.
You and I disagree on this specific point.
Edit: I find your comment a bit distasteful. If you can provide a scenario where it can get it incorrect, that’s a good discussion point. I don’t see many places where LLMs can’t verify as good as humans. If I developed a new business logic like - users from country X should not be able to use this feature - LLM can very easily verify this by generating its own sample api call and checking the response.
CuriouslyC 3 hours ago [-]
The whole point is that you can't 100% trust the LLM to infer your intent with accuracy from lossy natural language. Having it write tests doesn't change this, it's only asserting that its view of what you want is internally consistent, it is still just as likely to be an incorrect interpretation of your intent.
enraged_camel 2 hours ago [-]
>> The whole point is that you can't 100% trust the LLM to infer your intent with accuracy from lossy natural language.
You can't 100% trust a human either.
But, as with self-driving, the LLM simply needs to be better. It does not need to be perfect.
simianwords 60 minutes ago [-]
Good analogy
senordevnyc 3 hours ago [-]
The whole point is that you can't 100% trust the LLM to infer your intent with accuracy from lossy natural language.
Then it seems like the only workable solution from your perspective is a solo member team working on a product they came up with. Because as soon as there's more than one person on something, they have to use "lossy natural language" to communicate it between themselves.
CuriouslyC 3 hours ago [-]
Coworkers are absolutely an ongoing point of friction everywhere :)
On the plus side, IMO nonverbal cues make it way easier to tell when a human doesn't understand things than an agent.
codingdave 4 hours ago [-]
> If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement
At that point, outside of FAANG and their salaries, you are spending more on AI than you are on your humans. And they consider that level of spend to be a metric in and of itself. I'm kinda shocked the rest of the article just glossed over that one. It seems to be a breakdown of the entire vision of AI-driven coding. I mean, sure, the vendors would love it if everyone's salary budget just got shifted over to their revenue, but such a world is absolutely not my goal.
simonw 4 hours ago [-]
Yeah I'm going to update my piece to talk more about that.
This is an interesting point but if I may offer a different perspective:
Assuming 20 working days a month: that's 20k x 12 == 240k a year. So about a fresh grad's TC at FANG.
Now I've worked with many junior to mid-junior level SDEs and sadly 80% does not do a better job than Claude. (I've also worked with staff level SDEs who writes worse code than AI, but they offset that usually with domain knowledge and TL responsibilities)
I do see AI transform software engineering into even more of a pyramid with very few human on top.
mejutoco 3 hours ago [-]
Original claim was:
> At that point, outside of FAANG and their salaries, you are spending more on AI than you are on your humans
You say
> Assuming 20 working days a month: that's 20k x 12 == 240k a year. So about a fresh grad's TC at FANG.
So you both are in agreement on that part at least.
bobbiechen 4 hours ago [-]
Important too, a fully loaded salary costs the company far more than the actual salary that the employee receives. That would tip this balancing point towards 120k salaries, which is well into the realm of non-FAANG
dewey 4 hours ago [-]
It would depend on the speed of execution, if you can do the same amount of work in 5 days with spending 5k, vs spending a month and 5k on a human the math makes more sense.
verdverm 4 hours ago [-]
You won't know which path has larger long term costs, for a example, what if the AI version costs 10x to run?
4 hours ago [-]
kaffekaka 4 hours ago [-]
If the output is (dis)proportionally larger, the cost trade off might be the right thing to do.
And it might be the tokens will become cheaper.
obirunda 3 hours ago [-]
Tokens will become significantly more expensive in the short term actually. This is not stemming from some sort of anti-AI sentiment. You have two ramps that are going to drive this. 1. Increase demand, linear growth at least but likely this is already exponential. 2. Scaling laws demand, well, more scale.
Future better models will both demand higher compute use AND higher energy. We cannot underestimate the slowness of energy production growth and also the supplies required for simply hooking things up. Some labs are commissioning their own power plants on site, but this is not a true accelerator for power grid growth limits. You're using the same supply chain to build your own power plant.
If inference cost is not dramatically reduced and models don't start meaningfully helping with innovations that make energy production faster and inference/training demand less power, the only way to control demand is to raise prices. Current inference costs, do not pay for training costs. They can probably continue to do that on funding alone, but once the demand curve hits the power production limits, only one thing can slow demand and that's raising the cost of use.
philipp-gayret 4 hours ago [-]
$1,000 is maybe 5$ per workday. I measure my own usage and am on the way to $6,000 for a full year. I'm still at the stage where I like to look at the code I produce, but I do believe we'll head to a state of software development where one day we won't need to.
gipp 4 hours ago [-]
Maybe read that quote again. The figure is 1000 per day
verdverm 4 hours ago [-]
The quote is if you haven't spent $1000 per dev today
which sounds more like if you haven't reached this point you don't have enough experience yet, keep going
At least that's how I read the quote
delecti 3 hours ago [-]
Scroll further down (specifically to the section titled "Wait, $1,000/day per engineer?"). The quote in the quoted article (so from the original source in factory.strongdm.ai) could potentially be read either way, but Simon Willison (the direct link) absolutely is interpreting it as $1000/dev/day. I also think $1000/dev/day is the intended meaning in the strongdm article.
4 hours ago [-]
amarant 3 hours ago [-]
"If you haven't spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement"
Apart from being a absolutely ridiculous metric, this is a bad approach, at least with current generation models. In my experience, the less you inspect what the model does, the more spaghetti-like the code will be. And the flying spaghetti monster eats tokens faster than you can blink! Or put more clearly: implementing a feature will cost you a lot more tokens in a messy code base than it does in a clean one. It's not (yet) enough to just tell the agent to refactor and make it clean, you have to give it hints on how to organise the code.
I'd go do far as to say that if you're burning a thousand dollars a day per engineer, you're getting very little bang for your tokens.
It's short-term vs long-term optimization. Short-term optimization is making the system effective right now. Long-term optimization is exploring ways to improve the system as a whole.
This one is worth paying attention to to. They're the most ambitious team I've see exploring the limits of what you can do with this stuff. It's eye-opening.
enderforth 4 hours ago [-]
This right here is where I feel most concerned
> If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement
Seems to me like if this is true I'm screwed no matter if I want to "embrace" the "AI revolution" or not. No way my manager's going to approve me to blow $1000 a day on tokens, they budgeted $40,000 for our team to explore AI for the entire year.
Let alone from a personal perspective I'm screwed because I don't have $1000 a month in the budget to blow on tokens because of pesky things that also demand financial resources like a mortgage and food.
At this point it seems like damned if I do, damned if I don't. Feels bad man.
simonw 4 hours ago [-]
Yeah, that's one part of this that didn't sit right with me.
I don't think you need to spend anything like that amount of money to get the majority of the value they're describing here.
I wonder if this is just a byproduct of factories being very early and very inefficient. Yegge and Huntley both acknowledge that their experiments in autonomous factories are extremely expensive and wasteful!
I would expect cost to come down over time, using approaches pioneered in the field of manufacturing.
noosphr 4 hours ago [-]
This is the part that feels right to me because agents are idiots.
I built a tool that writes (non shit) reports from unstructured data to be used internally by analysts at a trading firm.
It cost between $500 to $5000 per day per seat to run.
It could have cost a lot more but latency matters in market reports in a way it doesn't for software. I imagine they are burning $1000 per day per seat because they can't afford more.
threecheese 3 hours ago [-]
They are idiots, but getting better. Ex: wrote an agent skill to do some read only stuff on a container filesystem. Stupid I know, it’s like a maintainer script that can make recommendations, whatever.
Another skill called skill-improver, which tries to reduce skill token usage by finding deterministic patterns in another skill that can be scripted, and writes and packages the script.
Putting them together, the container-maintenance thingy improves itself every iteration, validated with automatic testing. It works perfectly about 3/4 of the time, another half of the time it kinda works, and fails spectacularly the rest.
It’s only going to get better, and this fit within my Max plan usage while coding other stuff.
noosphr 3 hours ago [-]
LLMs are idiots and they will never get better because they have quadratic attention and a limited context window.
If the tokens that need to attend to each other are on opposite ends of the code base the only way to do that is by reading in the whole code base and hoping for the best.
If you're very lucky you can chunk the code base in such a way that the chunks pairwise fit in your context window and you can extract the relevant tokens hierarchically.
If you're not. Well get reading monkey.
Agents, md files, etc. are bandaids to hide this fact. They work great until they don't.
4 hours ago [-]
DrewADesign 2 hours ago [-]
> No way my manager's going to approve me to blow $1000 a day on tokens, they budgeted $40,000 for our team to explore AI for the entire year.
To be fair, I’ll bet many embracing concerning advice like that have never worked for the same company for a full year.
reilly3000 4 hours ago [-]
My friend works at Shopify and they are 100% all in on AI coding. They let devs spend as much as they want on whatever tool they want. If someone ends up spending a lot of money, they ask them what is going well and please share with others. If you’re not spending they have a different talk with you.
As for me, we get Cursor seats at work, and at home I have a GPU, a cheap Chinese coding plan, and a dream.
zingar 2 hours ago [-]
What results are you getting at home?
r0b05 3 hours ago [-]
> I have a GPU, a cheap Chinese coding plan, and a dream
Right in the feels
dude250711 3 hours ago [-]
> If someone ends up spending a lot of money, they ask them what is going well and please share with others. If you’re not spending they have a different talk with you.
Make a "systemctl start tokenspender.service" and share it with the team?
sergiotapia 3 hours ago [-]
I get $200 a month, I do wish I could get $1000 and stop worrying about trying the latest AI tools.
buster 4 hours ago [-]
May be the point is, that the one engineer replaces 10 engineers by using the dark factory which by definition doesn't need humans.
FeteCommuniste 4 hours ago [-]
The great hope of CEOs everywhere.
mgkimsal 3 hours ago [-]
I read that as combined, up to this point in time. You have 20 engineers? If you haven't spent at least $20k up to this point, you've not explored or experienced enough of the ins and outs to know how best to optimize the use of these tools.
I didn't read that as you need to be spending $1k/day per engineer. That is an insane number.
EDIT: re-reading... it's ambiguous to me. But perhaps they mean per day, every day. This will only hasten the elimination of human developers, which I presume is the point.
christoph 3 hours ago [-]
Same. Feels like it goes against the entire “hacker” ethos that brought me here in the first place. That sentence made me actually feel physically sick on initial read as well. Everyday now feels like a day where I have exponentially less & less interest in tech. If all of this AI that’s burning the planet is so incredible, where are the real world tangible improvements? I look around right now and everything in tech, software, internet, etc. has never looked so similar to a dumpster fire of trash.
zingar 2 hours ago [-]
The biggest rewards for human developers came from building addictive eyeball-getters for adverts so I don’t see how we can expect a very high bar for the results of their replacement AI factories. Real-world and tangible just seem completely out of the picture.
navanchauhan 4 hours ago [-]
I think corporate incentives vs personal incentives are slightly different here. As a company trying to experiment in this moment, you should be betting on token cost not being the bottleneck. If the tooling proves valuable, $1k/day per engineer is actually pretty cheap.
At home on my personal setup, I haven't even had to move past the cheapest codex/claude code subscription because it fulfills my needs ¯\_(ツ)_/¯. You can also get a lot of mileage out of the higher tiers of these subscriptions before you need to start paying the APIs directly.
rune-dev 4 hours ago [-]
How is 1k/day cheap? Even for a large company?
Takes like this are just baffling to me.
For one engineer that is ~260k a year.
dasil003 3 hours ago [-]
In big companies there is always waste, it's just not possible to be super efficient when you have tens of thousands of people. It's one thing in a steady state, low-competition business where you can refine and optimize processes so everyone knows exactly what their job is, but that is generally not the environment that software companies operate in. They need to be able innovate and stay competitive, never moreso than today.
The thing with AI is that it ranges from net-negative to easily brute forcing tedious things that we never have considered wasting human time on. We can't figure out where the leverage is unless all the subject matter experts in their various organizational niches really check their assumptions and get creative about experimenting and just trying different things that may never have crossed their mind before. Obviously over time best practices will emerge and get socialized, but with the rate that AI has been improving lately, it makes a lot of sense to just give employees carte blanche to explore. Soon enough there will be more scrutiny and optimization, but that doesn't really make sense without a better understanding of what is possible.
zingar 2 hours ago [-]
I assumed that they are saying that you spend $1k per day and that makes the developer as productive as some multiple of the number of people you could hire for that $1k.
libraryofbabel 3 hours ago [-]
I do not really agree with the below, but the logic is probably:
1) Engineering investment at companies generally pays off in multiples of what is spent on engineering time. Say you pay 10 engineers $200k / year each and the features those 10 engineers build grow yearly revenue by $10M. That’s a 4x ROI and clearly a good deal. (Of course, this only applies up to some ceiling; not every company has enough TAM to grow as big as Amazon).
2) Giving engineers near-unlimited access to token usage means they can create even more features, in a way that still produces positive ROI per token. This is the part I disagree with most. It’s complicated. You cannot just ship infinite slop and make money. It glosses over massive complexity in how software is delivered and used.
3) Therefore (so the argument goes) you should not cap tokens and should encourage engineers to use as many as possible.
Like I said, I don’t agree with this argument. But the key thing here is step 1. Engineering time is an investment to grow revenue. If you really could get positive ROI per token in revenue growth, you should buy infinite tokens until you hit the ceiling of your business.
Of course, the real world does not work like this.
rune-dev 3 hours ago [-]
Right, I understand of course that AI usage and token costs are an investment (probably even a very good one!).
But my point is moreso that saying 1k a day is cheap is ridiculous. Even for a company that expects an ROI on that investment. There’s risks involved and as you said, diminishing returns on software output.
I find AI bros view of the economics of AI usage strange. It’s reasonable to me to say you think its a good investment, but to say it’s cheap is a whole different thing.
libraryofbabel 3 hours ago [-]
Oh sure. We agree on all you said. I wouldn’t call it cheap either. :)
The best you can say is “high cost but positive ROI investment.” Although I don’t think that’s true beyond a certain point either, certainly not outside special cases like small startups with a lot of funding trying to build a product quickly. You can’t just spew tokens about and expect revenue to increase.
That said, I do reserve some special scorn for companies that penny-pinch on AI tooling. Any CTO or CEO who thinks a $200/month Claude Max subscription (or equivalent) for each developer is too much money to spent really needs to rethink their whole model of software ROI and costs. You’re often paying your devs >$100k yr and you won’t pay $2k / yr to make them more productive? I understand there are budget and planning cycle constraints blah blah, but… really?!
riazrizvi 3 hours ago [-]
Until there's something verifiable it's just talk. Talk was cheap. Now talk has become an order of magnitude cheaper since ChatGPT.
benreesman 3 hours ago [-]
It is tempting to be stealthy when you start seeing discontinuous capabilities go from totally random to somewhat predictable. But most of the key stuff is on GitHub.
The moats here are around mechanism design and values (to the extent they differ): the frontier labs are doomed in this world, the commons locked up behind paywalls gets hyper mirrored, value accrues in very different places, and it's not a nice orderly exponent from a sci-fi novel. It's nothing like what the talking heads at Davos say, Anthropic aren't in the top five groups I know in terms of being good at it, it'll get written off as fringe until one day it happens in like a day. So why be secretive?
You get on the ladder by throwing out Python and JSON and learning lean4, you tie property tests to lean theorems via FFI when you have to, you start building out rfl to pretty printers of proven AST properties.
And yeah, the droids run out ahead in little firecracker VMs reading from an effect/coeffect attestation graph and writing back to it. The result is saved, useful results are indexed. Human review is about big picture stuff, human coding is about airtight correctness (and fixing it when it breaks despite your "proof" that had a bug in the axioms).
Programming jobs are impacted but not as much as people think: droids do what David Graeber called bullshit jobs for the most part and then they're savants (not polymath geniuses) at a few things: reverse engineering and infosec they'll just run you over, they're fucking going in CIC.
This is about formal methods just as much as AI.
belter 3 hours ago [-]
Can you make an ethical declaration here, stating whether or not you are being compensated by them?
Their page looks to me like a lot of invented jargon and pure narrative. Every technique is just a renamed existing concept. Digital Twin Universe is mocks, Gene Transfusion is reading reference code, Semport is transpilation. The site has zero benchmarks, zero defect rates, zero cost comparisons, zero production outcomes. The only metric offered is "spend more money".
Anyone working honestly in this space knows 90% of agent projects are failing.
The main page of HN now has three to four posts daily with no substance, just Agentic AI marketing dressed as engineering insight.
With Google, Microsoft, and others spending $600 billion over the next year on AI, and panicking to get a return on that Capex....and with them now paying influencers over $600K [1] to manufacture AI enthusiasm to justify this infrastructure spend, I won't engage with any AI thought leadership that lacks a clear disclosure of financial interests and reproducible claims backed by actual data.
Show me a real production feature built entirely by agents with full traces, defect rates, and honest failure accounting. Or stop inventing vocabulary and posting vibes charts.
> Every technique is just a renamed existing concept. Digital Twin Universe is mocks, Gene Transfusion is reading reference code, Semport is transpilation. The site has zero benchmarks, zero defect rates, zero cost comparisons, zero production outcomes. The only metric offered is "spend more money".
Repeating for emphasis, because this is the VERY obvious question anyone with a shred of curiosity would be asking not just about this submission but about what is CONSTANTLY on the frontpage these days.
There could be a very simple 5 question questionnaire that could eliminate 90+% of AI coding requests before they start:
- Is this a small wrapper around just querying an existing LLM
- Does a brief summary of this searched with "site:github" already return dozens or hundreds of results?
- Is this a classic scam (pump&dump, etc) redone using "AI"
- Is this needless churn between already high level abstractions of technology (dashboard of dashboards, yaml to json, python to java script, automation of automation framework)
Thank you. Your disclosure page is better than all other AI commentators as
most disclose nothing at all. You do disclose an OpenAI payment, Microsoft travel,
and the existence of preview relationships.
However I would argue there are significant gaps:
- You do not name your consulting clients. You admit to do ad-hoc consulting and training
for unnamed companies while writing daily about AI products.
Those client names are material information.
- You have non payments that have monetary value. Free API credits, and weeks of early preview access,
flights, hotels, dinners, and event invitations are all compensation.
Do you keep those credits?
- The "I have not accepted payments from LLM vendors" could mean
receiving things worth thousands of dollars. Please note I am not saying you did.
- You have a structural conflict. Your favorable coverage will mean preview access, then exclusive content then traffic, then sponsors, then consulting clients.
- You appeared in an OpenAI promotional video for GPT-5 and were paid for it. This is influencer marketing by any definition.
- Your quotes are used as third-party validation in press coverage of AI product launches. This is a PR function with commercial value to these companies.
The FTC revised Endorsement Guides explicitly apply to bloggers, not just social media influencers.
The FTC defines material connection to include not only cash payments but also free products,
early access to a product, event invitations, and appearing in promotional media
all of which would seem to apply here.
They also say in the FTC own "Disclosures 101" guide that states [2]: "...Disclosures are likely to be missed
if they appear only on an ABOUT ME or profile page, at the end of posts or videos, or
anywhere that requires a person to click MORE."
I would argue an ecosystem of free access, preview privileges, promotional video appearances, API credits,
and undisclosed consulting does constitute a financial relationship that should be more transparently
disclosed than "I have not accepted payments from LLM vendors."
sjjsjdk 2 hours ago [-]
[dead]
japhyr 4 hours ago [-]
> That idea of treating scenarios as holdout sets—used to evaluate the software but not stored where the coding agents can see them—is fascinating. It imitates aggressive testing by an external QA team—an expensive but highly effective way of ensuring quality in traditional software.
This is one of the clearest takes I've seen that starts to get me to the point of possibly being able to trust code that I haven't reviewed.
The whole idea of letting an AI write tests was problematic because they're so focused on "success" that `assert True` becomes appealing. But orchestrating teams of agents that are incentivized to build, and teams of agents that are incentivized to find bugs and problematic tests, is fascinating.
I'm quite curious to see where this goes, and more motivated (and curious) than ever to start setting up my own agents.
Question for people who are already doing this: How much are you spending on tokens?
That line about spending $1,000 on tokens is pretty off-putting. For commercial teams it's an easy calculation. It's also depressing to think about what this means for open source. I sure can't afford to spend $1,000 supporting teams of agents to continue my open source work.
Lwerewolf 4 hours ago [-]
Re: $1k/day on tokens - you can also build a local rig, nothing "fancy". There was a recent thread here re: the utility of local models, even on not-so-fancy hardware. Agents were a big part of it - you just set a task and it's done at some point, while you sleep or you're off to somewhere or working on something else entirely or reading a book or whatever. Turn off notifications to avoid context switches.
I wouldn't be surprised if agents start "bribing" each other.
japhyr 1 hours ago [-]
If they're able to communicate with each other. But I'm pretty sure we could keep that from happening.
I don't take your comment as dismissive, but I think a lot of people are dismissing interesting and possibly effective approaches with short reactions like this.
I'm interested in the approach described in this article because it's specifying where the humans are in all this, it's not about removing humans entirely. I can see a class of problems where any non-determinism is completely unacceptable. But I can also see a large number of problems where a small amount of non-determinism is quite acceptable.
dist-epoch 3 minutes ago [-]
I was not dismissive, I was pointing out that preventing leaks is very hard.
They can communicate through the source code. Also Schelling points.
Something like "approve this PR and I will generate some easy bugs for you to find later"
verdverm 4 hours ago [-]
Do you know what those hold out twats should look like before thoroughly iterating on the problem?
I think people are burning money on tokens letting these things fumble about until they arrive at some working set of files.
I'm staying in the loop more than this, building up rather than tuning out
rileymichael 3 hours ago [-]
> In rule form:
- Code must not be written by humans
- Code must not be reviewed by humans
as a previous strongDM customer, i will never recommend their offering again. for a core security product, this is not the flex they think it is
also mimicking other products behavior and staying in sync is a fools task. you certainly won't be able to do it just off the API documentation. you may get close, but never perfect and you're going to experience constant breakage
simonw 3 hours ago [-]
Important to note that this is the approach taken by their AI research lab over the past six months, it's not (yet) reflective of how they build the core product.
andersmurphy 3 hours ago [-]
Right but how many unsuspecting customers like you do they need to have before they can exit?
From what I've heard the acquisition was unrelated to their AI lab work, it was about the core business.
andersmurphy 3 hours ago [-]
Thanks for the reply (always enjoy your sqlite content). It's definitely going to be interesting to see how all these AI labs playout when they are how the core business is built.
Herring 4 hours ago [-]
$100 says they're still doing leetcode interviews.
If everyone can do this, there won't be any advantage (or profit) to be had from it very soon. Why not buy your own hardware and run local models, I wonder.
navanchauhan 4 hours ago [-]
I would spend those $100 on either API tokens or donate to a charity of your choice. My interview to join this team was whether I could build something of my choosing in under an hour with any coding agent of my choice.
No local model out there is as good as the SOTA right now.
Herring 3 hours ago [-]
> My interview to join this team was whether I could build something of my choosing in under an hour with any coding agent of my choice.
You should have led with that. I think that's actually more impressive; anyone can spend tokens.
d0liver 4 hours ago [-]
> As I understood it the trick was effectively to dump the full public API documentation of one of those services into their agent harness and have it build an imitation of that API, as a self-contained Go binary. They could then have it build a simplified UI over the top to help complete the simulation.
This is still the same problem -- just pushed back a layer. Since the generated API is wrong, the QA outcomes will be wrong, too. Also, QAing things is an effective way to ensure that they work _after_ they've been reviewed by an engineer. A QA tester is not going to test for a vulnerability like a SQL injection unless they're guided by engineering judgement which comes from an understanding of the properties of the code under test.
The output is also essentially the definition of a derivative work, so it's probably not legally defensible (not that that's ever been a concern with LLMs).
galoisscobi 3 hours ago [-]
What has strongdm actually built? Are their users finding value from their supposed productivity gains?
If their focus is to only show their productivity/ai system but not having built anything meaningful with it, it feels like one of those scammy life coaches/productivity gurus that talk about how they got rich by selling their courses.
politelemon 3 hours ago [-]
> we transitioned from boolean definitions of success ("the test suite is green") to a probabilistic and empirical one. We use the term satisfaction to quantify this validation: of all the observed trajectories through all the scenarios, what fraction of them likely satisfy the user?
Oh, to have the luxury of redefining success and handwaving away hard learned lessons in the software industry.
hnthrow0287345 4 hours ago [-]
Yep, you definitely want to be in the business of selling shovels for the gold rush.
mellosouls 6 hours ago [-]
Having submitted this I would also suggest the website admin revisit their testing; its very slow on my phone. Obviously fails on aesthetics and accessibility as well. Submitted for the essay.
pengaru 4 hours ago [-]
Sounds like you're experiencing an "agentic moment".
pityJuke 5 hours ago [-]
Haha yeah if I scroll on my iPhone 15 Pro it literally doesn’t load until I stop.
4 hours ago [-]
foolserrandboy 5 hours ago [-]
I get the following on safari on iOs: A problem repeatedly occurred on (url)
throwaway0123_5 4 hours ago [-]
On iOS Safari it loads and works decent for me, but w/ iOS Firefox and Firefox Focus doesn't even load.
belter 3 hours ago [-]
Lets hope the agents in their factory can fix it asap...
wrs 4 hours ago [-]
On the cxdb “product” page one reason they give against rolling your own is that it would be “months of work”. Slipped into an archaic off-brand mindset there, no?
verdverm 3 hours ago [-]
We make this great, just don't use it to build the same thing we offer
Heat death of the SaaSiverse
eclipsetheworld 4 hours ago [-]
I have been working on my own "Digital Twins Universe" because 3rd-party SaaS tools often block the tight feedback loops required for long-horizon agentic coding. Unlike Stripe, which offers a full-featured environment usable in both development and staging, most B2B SaaS companies lack adequate fidelity (e.g., missing webhooks in local dev) or even a basic staging environment.
Taking the time to point a coding agent towards the public (or even private) API of a B2B SaaS app to generate a working (partial) clone is effectively "unblocking" the agent. I wouldn't be surprised if a "DTU-hub" eventually gains traction for publishing and sharing these digital twins.
I would love to hear more about your learnings from building these digital twins. How do you handle API drift? Also, how do you handle statefulness within the twins? Do you test for divergence? For example, do you compare responses from the live third-party service against the Digital Twin to check for parity?
mccoyb 3 hours ago [-]
Effectively everyone is building the same tools with zero quantitative benchmarks or evidence behind the why / ideas … this entire space is a nightmare to navigate because of this. Who cares without proper science, seriously? I look through this website and it looks like a preview for a course I’m supposed to buy … when someone builds something with these sorts of claims attached, I assume that there is going to be some “real graphs” (“these are the number of times this model deviated from the spec before we added error correction …”)
What we have instead are many people creating hierarchies of concepts, a vast “naming” of their own experiences, without rigorous quantitative evaluation.
I may be alone in this, but it drives me nuts.
Okay, so with that in mind, it amounts to heresay “these guys are doing something cool” — why not shut up or put up with either (a) an evaluation of the ideas in a rigorous, quantitative way or (b) apply the ideas to produce an “hard” artifact (analogous, e.g., to the Anthropic C compiler, the Cursor browser) with a reproducible pathway to generation.
The answer seems to be that (b) is impossible (as long as we’re on the teet of the frontier labs, which disallow the kind of access that would make (b) possible) and the answer for (a) is “we can’t wait we have to get our names out there first”
I’m disappointed to see these types of posts on HN. Where is the science?
simonw 2 hours ago [-]
Honestly I've not found a huge amount of value from the "science".
There are plenty of papers out there that look at LLM productivity and every one of them seems to have glaring methodology limitations and/or reports on models that are 12+ months out of date.
Have you seen any papers that really elevated your understanding of LLM productivity with real-world engineering teams?
mccoyb 2 hours ago [-]
No, I agree! But I don’t think that observation gives us license to avoid the problem.
Further, I’m not sure this elevates my understanding: I’ve read many posts on this space which could be viewed as analogous to this one (this one is more tempered, of course). Each one has this same flaw: someone is telling me I need to make a “organization” out of agents and positive things will follow.
Without a serious evaluation, how am I supposed to validate the author’s ontology?
Do you disagree with my assessment? Do you view the claims in this content as solid and reproducible?
My own view is that these are “soft ideas” (GasTown, Ralph fall into a similar category) without the rigorous justification.
What this amounts to is “synthetic biology” with billion dollar probability distributions — where the incentives are setup so that companies are incentivized to convey that they have the “secret sauce” … for massive amounts of money.
To that end, it’s difficult to trust a word out of anyone’s mouth — even if my empirical experiences match (along some projection).
simonw 1 hours ago [-]
The multi-agent "swarm" thing (that seems to be the term that's bubbling to the top at the moment) is so new and frothy that is difficult to determine how useful it actually is.
StrongDM's implementation is the most impressive I've seen myself, but it's also incredibly expensive. Is it worth the cost?
Cursor's FastRender experiment was also interesting but also expensive for what was achieved.
I think my favorite current example at the moment was Anthropic's $20,000 C compiler from the other day. But they're an AI vendor, demos from non-vendors carry more weight.
I've seen enough to be convinced that there's something there, but I'm also confident we aren't close to figuring out the optimal way of putting this stuff to work yet.
svara 1 hours ago [-]
The writing on this website is giving strong web3 vibes to me / doesn't smell right.
The only reason I'm not dismissing it out of hand is basically because you said this team was worth taking a look at.
I'm not looking for a huge amount of statistical ceremony, but some detail would go a long way here.
What exactly was achieved for what effort and how?
neya 3 hours ago [-]
The solution to this problem is not throwing everything at AI. To get good results from any AI model, you need an architect (human) instructing it from the top. And the logic behind this is that AI has been trained on millions of opinions on getting a particular task done. If you ask a human, they almost always have one opinionated approach for a given task. The human's opinion is a derivative of their lived experience, sometimes foreseeing all the way to the end result an AI cannot foresee. Eg. I want a database column a certain type because I'm thinking about adding an E-Commerce feature to my CMS later. An AI might not have this insight.
Of course, you can't always tell the model what to do, especially if it is a repeated task. It turns out, we already solved this decades ago using algorithms. Repeatable, reproducible, reliable. The challenge (and the reward) lies in separating the problem statement into algorithmic and agentic. Once you achieve this, the $1000 token usage is not needed at all.
I have a working prototype of the above and I'm currently productizing it (shameless plug):
However - I need to emphasize, the language you use to apply the pattern above matters. I use Elixir specifically for this, and it works really, really well.
It works based off starting with the architect. You. It feeds off specs and uses algorithms as much as possible to automate code generation (eg. Scaffolding) and only uses AI sparsely when needed.
Of course, the downside of this approach is that you can't just simply say "build me a social network". You can however say something like "Build me a social network where users can share photos, repost, like and comment on them".
Once you nail the models used in the MVC pattern, their relationships, the software design is pretty much 50% battle won. This is really good for v1 prototypes where you really want best practices enforced, OSWAP compliant code, security-first software output which is where a pure agentic/AI approach would mess up.
IT perspective here. Simon hits the nail on the head as to what I'm genuinely looking forward to:
> How do you clone the important parts of Okta, Jira, Slack and more? With coding agents!
This is what's going to gut-punch most SaaS companies repeatedly over the next decade, even if this whole build-out ultimately collapses in on itself (which I expect it to). The era of bespoke consultants for SaaS product suites to handle configuration and integrations, while not gone, are certainly under threat by LLMs that can ingest user requirements and produce functional code to do a similar thing at a fraction of the price.
What a lot of folks miss is that in enterprise-land, we only need the integration once. Once we have an integration, it basically exists with minimal if any changes until one side of the integration dies. Code fails a security audit? We can either spool up the agents again briefly to fix it, or just isolate it in a security domain like the glut of WinXP and Win7 boxen rotting out there on assembly lines and factory floors.
This is why SaaS stocks have been hammered this week. It's not that investors genuinely expect huge players to go bankrupt due to AI so much as they know the era of infinite growth is over. It's also why big AI companies are rushing IPOs even as data center builds stall: we're officially in a world where a locally-run model - not even an Agent, just a model in LM Studio on the Corporate Laptop - can produce sufficient code for a growing number of product integrations without any engineer having to look through yet another set of API documentation. As agentic orchestration trickles down to homelabs and private servers on smaller, leaner, and more efficient hardware, that capability is only going to increase, threatening profits of subscription models and large AI companies. Again, why bother ponying up for a recurring subscription after the work is completed?
For full-fledged software, there's genuine benefit to be had with human intervention and creativity; for the multitude of integrations and pipelines that were previously farmed out to pricey consultants, LLMs will more than suffice for all but the biggest or most complex situations.
theshrike79 3 hours ago [-]
“API Glue” is what I’ve called it since forever
Stuff comes in from an API goes out to a different API.
With a semi-decent agent I can build what took me a week or two in hours just because it can iterate the solution faster than any human can type.
A new field in the API could’ve been a two day ordeal of patching it through umpteen layers of enterprise frameworks. Now I can just tell Claude to add it, it’ll do it up to the database in minutes - and update the tests at the same time.
stego-tech 3 hours ago [-]
And because these are all APIs, we can brute-force it with read-only operations with minimal review times. If the read works, the write almost always will, and then it's just a matter of reading and documenting the integration before testing it in dev or staging.
So much of enterprise IT nowadays is spent hammering or needling vendors for basic API documentation so we can write a one-off that hooks DB1 into ServiceNow that's also pulling from NewRelic just to do ITAM. Consultants would salivate over such a basic integration because it'd be their yearly salary over a three month project.
Now we can do this ourselves with an LLM in a single sprint.
That's a Pandora's Box moment right there.
simianwords 3 hours ago [-]
I like the idea but I'm not so sure this problem can be solved generally.
As an example: imagine someone writing a data pipeline for training a machine learning model. Anyone who's done this knows that such a task involves lots data wrangling work like cleaning data, changing columns and some ad hoc stuff.
The only way to verify that things work is if the eventual model that is trained performs well.
In this case, scenario testing doesn't scale up because the feedback loop is extremely large - you have to wait until the model is trained and tested on hold out data.
Scenario testing clearly can not work on the smaller parts of the work like data wrangling.
CubsFan1060 4 hours ago [-]
I can't tell if this is genius or terrifying given what their software does. Probably a bit of both.
I wonder what the security teams at companies that use StrongDM will think about this.
verdverm 3 hours ago [-]
I doubt this would be allowed in regulated industries like healthcare
svilen_dobrev 1 hours ago [-]
how about the elephant.. Apart of business-spec itself, Where-from all those (supply-chain) API specs/documentation are going to come? After, say, 3 iterations in this vein, of the API-makers themselves ??
chopete3 6 minutes ago [-]
"These go to 11"
navanchauhan 4 hours ago [-]
(I’m one of the people on this team). I joined fresh out of college, and it’s been a wild ride.
I’m happy to answer any questions!
steveklabnik 4 hours ago [-]
More of a comment than a question:
> Those of us building software factories must practice a deliberate naivete
This is a great way to put it, I've been saying "I wonder which sacred cows are going to need slaughtered" but for those that didn't grow up on a farm, maybe that metaphor isn't the best. I might steal yours.
This stuff is very interesting and I'm really interested to see how it goes for you, I'll eagerly read whatever you end up putting out about this. Good luck!
EDIT: oh also the re-implemented SaaS apps really recontextualizes some other stuff I’ve been doing too…
axus 4 hours ago [-]
> "I wonder which sacred cows are going to need slaughtered"
Or a vegan or Hindu. Which ethics are you willing to throw away to run the software factory?
I eat hamburgers while aware of the moral issues.
jessmartin 3 hours ago [-]
I’ve been building using a similar approach[1] and my intuition is that humans will be needed at some points in the factory line for specific tasks that require expertise/taste/quality. Have you found that the be the case? Where do you find that humans should be involved in the process of maximal leverage?
To name one probable area of involvement: how do you specify what needs to be built?
Your intuition/thinking definitely lines up with how we're thinking about this problem. If you have a good definition of done and a good validation harness, these agents can hill climb their way to a solution.
But you still need human taste/judgment to decide what you want to build (unless your solution is to just brute force the entire problem space).
For maximal leverage, you should follow the mantra "Why am I doing this?" If you use this enough times, you'll come across the bottleneck that can only be solved by you for now. As a human, your job is to set the higher-level requirements for what you're trying to build. Coming up with these requirements and then using agents to shape them up is acceptable, but human judgment is definitely where we have to answer what needs to be built. At the same time, I never want to be doing something the models are better at. Until we crack the proactiveness part, we'll be required to figure out what to do next.
Also, it looks like you and Danvers are working in the same space, and we love trading notes with other teams working in this area. We'd love to connect. You can either find my personal email or shoot me an email at my work email: navan.chauhan [at] strongdm.com
simonw 4 hours ago [-]
I know you're not supposed to look at the code, but do you have things in place to measure and improve code quality anyway?
Not just code review agents, but things like "find duplicated code and refactor it"?
navanchauhan 4 hours ago [-]
A few overnight “attractor” workflows serve distinct purposes:
* DRYing/Refactoring if needed
* Documentation compaction
* Security reviews
4 hours ago [-]
g947o 4 hours ago [-]
Serious question: what's keeping a competitor from doing the same thing and doing it better than you?
simonw 4 hours ago [-]
That's a genuine problem now. If you launch a new feature and your competition can ship their own copy a few hours later the competitive dynamics get really challenging!
My hunch is that the thing that's going to matter is network effects and other forms of soft lockin. Features alone won't cut it - you need to build something where value accumulates to your user over time in a way that discourages them from leaving.
CubsFan1060 4 hours ago [-]
The interesting part about that is both of those things require some sort of time to start.
If I launch a new product, and 4 hours later competitors pop up, then there's not enough time for network effects or lockin.
I'm guessing what is really going to be needed is something that can't be just copied. Non-public data, business contracts, something outside of software.
verdverm 3 hours ago [-]
Marketing and brand are still the most important, though I personally hope for a world where business is more indie and less winner take all
You can see the first waves of this trend in HN new.
andersmurphy 3 hours ago [-]
Wouldn't the incumbents with their fantastic distribution channels, brand, lockin, marketing, capital and own models just wipe the floor with everyone as talent no longer matters?
srcreigh 3 hours ago [-]
This is just sleight of hand.
In this model the spec/scenarios are the code. These are curated and managed by humans just like code.
They say "non interactive". But of course their work is interactive. AI agents take a few minutes-hours whereas you can see code change result in seconds. That doesn't mean AI agents aren't interactive.
I'm very AI-positive, and what they're doing is different, but they are basically just lying. It's a new word for a new instance of the same old type of thing. It's not a new type of thing.
The common anti-AI trope is "AI just looked at <human output> to do this." The common AI trope from the StrongDM is "look, the agent is working without human input." Both of these takes are fundamentally flawed.
AI will always depend on humans to produce relevant results for humans. It's not a flaw of AI, it's more of a flaw of humans. Consequently, "AI needs human input to produce results we want to see" should not detract from the intelligence of AI.
Why is this true? At a certain point you just have Kolmogorov complexity, AI having fixed memory and fixed prompt size, pigeonhole principle, not every output is possible to be produced even with any input given specific model weights.
Recursive self-improvement doesn't get around this problem. Where does it get the data for next iteration? From interactions with humans.
With the infinite complexity of mathematics, for instance solving Busy Beaver numbers, this is a proof that AI can in fact not solve every problem. Humans seem to be limited in this regard as well, but there is no proof that humans are fundamentally limited this way like AI. This lack of proof of the limitations of humans is the precise advantage in intelligence that humans will always have over AI.
rhrthg 4 hours ago [-]
Can you disclose the number of Substack subscriptions and whether there is an unusual amount of bulk subscriptions from certain entities?
simonw 4 hours ago [-]
I recently passed 40,000 but my Substack is free so it's not a revenue source for me. I haven't really looked at who they are - at some point it would be interesting to export the CSV of the subscribers and count by domains, I guess.
My content revenue comes from ads on my blog via https://www.ethicalads.io/ - rarely more than $1,000 in a given month - and sponsors on GitHub: https://github.com/sponsors/simonw - which is adding up to quite good money now. Those people get my sponsors-only monthly newsletter which looks like this: https://gist.github.com/simonw/13e595a236218afce002e9aeafd75... - it's effectively the edited highlights from my blog because a lot of people are too busy to read everything I put out there!
So much of this resonated with me, and I realize I’ve arrived at a few of the techniques myself (and with my team) over the last several months.
THIS FRIGHTENS ME. Many of us sweng are either going be FIRE millionaires, or living under a bridge, in two years.
I’ve spent this week performing SemPort; found a ts app that does a needed thing, and was able to use a long chain of prompts to get it completely reimplemented in our stack, using Gene Transfer to ensure it uses some existing libraries and concrete techniques present in our existing apps.
Now not only do I have an idiomatic Python port, which I can drop right into our stack, but I have an extremely detailed features/requirements statement for the origin typescript app along with the prompts for generating it. I can use this to continuously track this other product as it improves. I also have the “instructions infrastructure” to direct an agent to align new code to our stack. Two reusable skills, a new product, and it took a week.
cbeach 3 hours ago [-]
Please let’s not call ourselves “swengs”
Is it really that hard to write “developer” or “engineer”?
beepbooptheory 3 hours ago [-]
Sorry if rude but truly feel like I am missing the joke. This is just LinkedIn copypasta or something right?
threecheese 3 hours ago [-]
My post? Shiiiii if that’s how it comes across I may delete it. I haven’t logged into LI since our last corp reorg, it was a cesspool even then. Self promotion just ain’t my bag
I was just trying to share the same patterns from OPs documentation that I found valuable within the context of agentic development; seeing them take this so far is was scares me, because they are right that I could wire an agent to do this autonomously and probably get the same outcomes, scaled.
Thanks. I’m unable to find the term “domain model” on the website.
navanchauhan 3 hours ago [-]
It’s part of the “lore” that gets passed down when you join the company.
Funnily enough, the marketing department even ran a campaign asking, “What does DM stand for?!”, and the answer was “Digital Metropolis,” because we did a design refresh.
I just linked the website because that’s what the actual company does, and we are just the “AI Lab”
dude250711 3 hours ago [-]
Doomy marketing?
dist-epoch 3 hours ago [-]
Gas Town, but make it Enterprise.
AlexeyBrin 3 hours ago [-]
Code must not be written by humans
Code must not be reviewed by humans
I feel like I'm taking crazy pills. I would avoid this company like the plague.
Rendered at 21:33:08 GMT+0000 (Coordinated Universal Time) with Vercel.
…What am I even reading? Am I crazy to think this is a crazy thing to say, or it’s actually crazy?
My bosses bosses boss like to claim that we're successfully moving to the cloud because the cost is increasing year over year.
The only github I could find is: https://github.com/strongdm/attractor
Canadian girlfriend coding is now a business model.Edit:
I did find some code. Commit history has been squashed unfortunately: https://github.com/strongdm/cxdb
There's a bunch more under the same org but it's years old.
(I'm continuing to try to learn Rust!)
- The StoreError type is stringly typed and generally badly thought out. Depending on what they actually want to do, they should either add more variants to StoreError for the difference failure cases, replaces the strings with a sub-types (probably enums) to do the same, or write an type erased error similar to (or wrapping) the ones provided by anyhow, eyre, etc, but with a status code attached. They definitely shouldn't be checking for substrings in their own error type for control flow.
- So many calls to String::clone [0]. Several of the ones I saw were actually only necessary because the function took a parameter by reference even though it could have (and I would argue should have) taken it by value (If I had to guess, I'd say the agent first tried to do it without the clone, got an error, and implemented a local fix without considering the broader context).
- A lot of errors are just ignored with Result::unwrap_or_default or the like. Sometimes that's the right choice, but from what I can see they're allowing legitimate errors to pass silently. They also treat the values they get in the error case differently, rather than e.g. storing a Result or Option.
- Their HTTP handler has an 800 line long closure which they immediately call, apparently as a substitute for the the still unstable try_blocks feature. I would strongly recommend moving that into it's own full function instead.
- Several if which should have been match.
- Calls to Result::unwrap and Option::unwrap. IMO in production code you should always at minimum use expect instead, forcing you to explain what went wrong/why the Err/None case is impossible.
It wouldn't catch all/most of these (and from what I've seen might even induce some if agents continue to pursue the most local fix rather than removing the underlying cause), but I would strongly recommend turning on most of clippy's lints if you want to learn rust.
[0] https://rust-unofficial.github.io/patterns/anti_patterns/bor...
For those of us working on building factories, this is pretty obvious because once you immediately need shared context across agents / sessions and an improved ID + permissions system to keep track of who is doing what.
PS: TIL about "Canadian girlfriend", thanks!
The worst part is they got simonw to (perhaps unwittingly or social engineering) vouch and stealth market for them.
And $1000/day/engineer in token costs at current market rates? It's a bold strategy, Cotton.
But we all know what they're going for here. They want to make themselves look amazing to convince the boards of the Great Houses to acquire them. Because why else would investors invest in them and not in the Great Houses directly.
(Two people who's opinions I respect said "yeah you really should accept that invitation" otherwise I probably wouldn't have gone.)
I've been looking forward to being able to write more details about what they're doing ever since.
EDIT nvm just saw your other comment.
We’ve been working on this since July, and we shared the techniques and principles that have been working for us because we thought others might find them useful. We’ve also open-sourced the nlspec so people can build their own versions of the software factory.
We’re not selling a product or service here. This also isn’t about positioning for an acquisition: we’ve already been in a definitive agreement to be acquired since last month.
It’s completely fair to have opinions and to not like what we’re putting out, but your comment reads as snarky without adding anything to the conversation.
There are higher and lower leverage ways to do that, for instance reviewing tests and QA'ing software via use vs reading original code, but you can't get away from doing it entirely.
What I’m working on (open source) is less about replacing human validation and more about scaling it: using multiple independent agents with explicit incentives and disagreement surfaced, instead of trusting a single model or a single reviewer.
Humans are still the final authority, but consensus, adversarial review, and traceable decision paths let you reserve human attention for the edge cases that actually matter, rather than reading code or outputs linearly.
Until we treat validation as a first-class system problem (not a vibe check on one model’s answer), most of this will stay in “cool demo” territory.
We’ve spent years systematizing generation, testing, and deployment. Validation largely hasn’t changed, even as the surface area has exploded. My interest is in making that human effort composable and inspectable, not pretending it can be eliminated.
And “define the spec concretely“ (and how to exploit emerging behaviors) becomes the new definition of what programming is.
(and unambiguously. and completely. For various depths of those)
This always has been the crux of programming. Just has been drowned in closer-to-the-machine more-deterministic verbosities, be it assembly, C, prolog, js, python, html, what-have-you
There have been a never ending attempts to reduce that to more away-from-machine representation. Low-code/no-code (anyone remember Last-one for Apple ][ ?), interpreting-and/or-generating-off DSLs of various level of abstraction, further to esperanto-like artificial reduced-ambiguity languages... some even english-like..
For some domains, above worked/works - and the (business)-analysts became new programmers. Some companies have such internal languages. For most others, not really. And not that long ago, the SW-Engineer job was called Analyst-programmer.
But still, the frontier is there to cross..
>StrongDM’s answer was inspired by Scenario testing (Cem Kaner, 2003).
> You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.
You don't need a human who knows the system to validate it if you trust the LLM to do the scenario testing correctly. And from my experience, it is very trustable in these aspects.
Can you detail a scenario by which an LLM can get the scenario wrong?
You and I disagree on this specific point.
Edit: I find your comment a bit distasteful. If you can provide a scenario where it can get it incorrect, that’s a good discussion point. I don’t see many places where LLMs can’t verify as good as humans. If I developed a new business logic like - users from country X should not be able to use this feature - LLM can very easily verify this by generating its own sample api call and checking the response.
You can't 100% trust a human either.
But, as with self-driving, the LLM simply needs to be better. It does not need to be perfect.
Then it seems like the only workable solution from your perspective is a solo member team working on a product they came up with. Because as soon as there's more than one person on something, they have to use "lossy natural language" to communicate it between themselves.
On the plus side, IMO nonverbal cues make it way easier to tell when a human doesn't understand things than an agent.
At that point, outside of FAANG and their salaries, you are spending more on AI than you are on your humans. And they consider that level of spend to be a metric in and of itself. I'm kinda shocked the rest of the article just glossed over that one. It seems to be a breakdown of the entire vision of AI-driven coding. I mean, sure, the vendors would love it if everyone's salary budget just got shifted over to their revenue, but such a world is absolutely not my goal.
Edit: here's that section: https://simonwillison.net/2026/Feb/7/software-factory/#wait-...
Assuming 20 working days a month: that's 20k x 12 == 240k a year. So about a fresh grad's TC at FANG.
Now I've worked with many junior to mid-junior level SDEs and sadly 80% does not do a better job than Claude. (I've also worked with staff level SDEs who writes worse code than AI, but they offset that usually with domain knowledge and TL responsibilities)
I do see AI transform software engineering into even more of a pyramid with very few human on top.
> At that point, outside of FAANG and their salaries, you are spending more on AI than you are on your humans
You say
> Assuming 20 working days a month: that's 20k x 12 == 240k a year. So about a fresh grad's TC at FANG.
So you both are in agreement on that part at least.
And it might be the tokens will become cheaper.
Future better models will both demand higher compute use AND higher energy. We cannot underestimate the slowness of energy production growth and also the supplies required for simply hooking things up. Some labs are commissioning their own power plants on site, but this is not a true accelerator for power grid growth limits. You're using the same supply chain to build your own power plant.
If inference cost is not dramatically reduced and models don't start meaningfully helping with innovations that make energy production faster and inference/training demand less power, the only way to control demand is to raise prices. Current inference costs, do not pay for training costs. They can probably continue to do that on funding alone, but once the demand curve hits the power production limits, only one thing can slow demand and that's raising the cost of use.
which sounds more like if you haven't reached this point you don't have enough experience yet, keep going
At least that's how I read the quote
Apart from being a absolutely ridiculous metric, this is a bad approach, at least with current generation models. In my experience, the less you inspect what the model does, the more spaghetti-like the code will be. And the flying spaghetti monster eats tokens faster than you can blink! Or put more clearly: implementing a feature will cost you a lot more tokens in a messy code base than it does in a clean one. It's not (yet) enough to just tell the agent to refactor and make it clean, you have to give it hints on how to organise the code.
I'd go do far as to say that if you're burning a thousand dollars a day per engineer, you're getting very little bang for your tokens.
And your engineers probably look like this: https://share.google/H5BFJ6guF4UhvXMQ7
I wrote a bunch more about that this morning: https://simonwillison.net/2026/Feb/7/software-factory/
This one is worth paying attention to to. They're the most ambitious team I've see exploring the limits of what you can do with this stuff. It's eye-opening.
> If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement
Seems to me like if this is true I'm screwed no matter if I want to "embrace" the "AI revolution" or not. No way my manager's going to approve me to blow $1000 a day on tokens, they budgeted $40,000 for our team to explore AI for the entire year.
Let alone from a personal perspective I'm screwed because I don't have $1000 a month in the budget to blow on tokens because of pesky things that also demand financial resources like a mortgage and food.
At this point it seems like damned if I do, damned if I don't. Feels bad man.
I don't think you need to spend anything like that amount of money to get the majority of the value they're describing here.
Edit: added a new section to my blog post about this: https://simonwillison.net/2026/Feb/7/software-factory/#wait-...
I would expect cost to come down over time, using approaches pioneered in the field of manufacturing.
I built a tool that writes (non shit) reports from unstructured data to be used internally by analysts at a trading firm.
It cost between $500 to $5000 per day per seat to run.
It could have cost a lot more but latency matters in market reports in a way it doesn't for software. I imagine they are burning $1000 per day per seat because they can't afford more.
Another skill called skill-improver, which tries to reduce skill token usage by finding deterministic patterns in another skill that can be scripted, and writes and packages the script.
Putting them together, the container-maintenance thingy improves itself every iteration, validated with automatic testing. It works perfectly about 3/4 of the time, another half of the time it kinda works, and fails spectacularly the rest.
It’s only going to get better, and this fit within my Max plan usage while coding other stuff.
If the tokens that need to attend to each other are on opposite ends of the code base the only way to do that is by reading in the whole code base and hoping for the best.
If you're very lucky you can chunk the code base in such a way that the chunks pairwise fit in your context window and you can extract the relevant tokens hierarchically.
If you're not. Well get reading monkey.
Agents, md files, etc. are bandaids to hide this fact. They work great until they don't.
To be fair, I’ll bet many embracing concerning advice like that have never worked for the same company for a full year.
As for me, we get Cursor seats at work, and at home I have a GPU, a cheap Chinese coding plan, and a dream.
Right in the feels
Make a "systemctl start tokenspender.service" and share it with the team?
I didn't read that as you need to be spending $1k/day per engineer. That is an insane number.
EDIT: re-reading... it's ambiguous to me. But perhaps they mean per day, every day. This will only hasten the elimination of human developers, which I presume is the point.
At home on my personal setup, I haven't even had to move past the cheapest codex/claude code subscription because it fulfills my needs ¯\_(ツ)_/¯. You can also get a lot of mileage out of the higher tiers of these subscriptions before you need to start paying the APIs directly.
Takes like this are just baffling to me.
For one engineer that is ~260k a year.
The thing with AI is that it ranges from net-negative to easily brute forcing tedious things that we never have considered wasting human time on. We can't figure out where the leverage is unless all the subject matter experts in their various organizational niches really check their assumptions and get creative about experimenting and just trying different things that may never have crossed their mind before. Obviously over time best practices will emerge and get socialized, but with the rate that AI has been improving lately, it makes a lot of sense to just give employees carte blanche to explore. Soon enough there will be more scrutiny and optimization, but that doesn't really make sense without a better understanding of what is possible.
1) Engineering investment at companies generally pays off in multiples of what is spent on engineering time. Say you pay 10 engineers $200k / year each and the features those 10 engineers build grow yearly revenue by $10M. That’s a 4x ROI and clearly a good deal. (Of course, this only applies up to some ceiling; not every company has enough TAM to grow as big as Amazon).
2) Giving engineers near-unlimited access to token usage means they can create even more features, in a way that still produces positive ROI per token. This is the part I disagree with most. It’s complicated. You cannot just ship infinite slop and make money. It glosses over massive complexity in how software is delivered and used.
3) Therefore (so the argument goes) you should not cap tokens and should encourage engineers to use as many as possible.
Like I said, I don’t agree with this argument. But the key thing here is step 1. Engineering time is an investment to grow revenue. If you really could get positive ROI per token in revenue growth, you should buy infinite tokens until you hit the ceiling of your business.
Of course, the real world does not work like this.
But my point is moreso that saying 1k a day is cheap is ridiculous. Even for a company that expects an ROI on that investment. There’s risks involved and as you said, diminishing returns on software output.
I find AI bros view of the economics of AI usage strange. It’s reasonable to me to say you think its a good investment, but to say it’s cheap is a whole different thing.
The best you can say is “high cost but positive ROI investment.” Although I don’t think that’s true beyond a certain point either, certainly not outside special cases like small startups with a lot of funding trying to build a product quickly. You can’t just spew tokens about and expect revenue to increase.
That said, I do reserve some special scorn for companies that penny-pinch on AI tooling. Any CTO or CEO who thinks a $200/month Claude Max subscription (or equivalent) for each developer is too much money to spent really needs to rethink their whole model of software ROI and costs. You’re often paying your devs >$100k yr and you won’t pay $2k / yr to make them more productive? I understand there are budget and planning cycle constraints blah blah, but… really?!
The moats here are around mechanism design and values (to the extent they differ): the frontier labs are doomed in this world, the commons locked up behind paywalls gets hyper mirrored, value accrues in very different places, and it's not a nice orderly exponent from a sci-fi novel. It's nothing like what the talking heads at Davos say, Anthropic aren't in the top five groups I know in terms of being good at it, it'll get written off as fringe until one day it happens in like a day. So why be secretive?
You get on the ladder by throwing out Python and JSON and learning lean4, you tie property tests to lean theorems via FFI when you have to, you start building out rfl to pretty printers of proven AST properties.
And yeah, the droids run out ahead in little firecracker VMs reading from an effect/coeffect attestation graph and writing back to it. The result is saved, useful results are indexed. Human review is about big picture stuff, human coding is about airtight correctness (and fixing it when it breaks despite your "proof" that had a bug in the axioms).
Programming jobs are impacted but not as much as people think: droids do what David Graeber called bullshit jobs for the most part and then they're savants (not polymath geniuses) at a few things: reverse engineering and infosec they'll just run you over, they're fucking going in CIC.
This is about formal methods just as much as AI.
Their page looks to me like a lot of invented jargon and pure narrative. Every technique is just a renamed existing concept. Digital Twin Universe is mocks, Gene Transfusion is reading reference code, Semport is transpilation. The site has zero benchmarks, zero defect rates, zero cost comparisons, zero production outcomes. The only metric offered is "spend more money".
Anyone working honestly in this space knows 90% of agent projects are failing.
The main page of HN now has three to four posts daily with no substance, just Agentic AI marketing dressed as engineering insight.
With Google, Microsoft, and others spending $600 billion over the next year on AI, and panicking to get a return on that Capex....and with them now paying influencers over $600K [1] to manufacture AI enthusiasm to justify this infrastructure spend, I won't engage with any AI thought leadership that lacks a clear disclosure of financial interests and reproducible claims backed by actual data.
Show me a real production feature built entirely by agents with full traces, defect rates, and honest failure accounting. Or stop inventing vocabulary and posting vibes charts.
[1] - https://news.ycombinator.com/item?id=46925821
Repeating for emphasis, because this is the VERY obvious question anyone with a shred of curiosity would be asking not just about this submission but about what is CONSTANTLY on the frontpage these days.
There could be a very simple 5 question questionnaire that could eliminate 90+% of AI coding requests before they start:
- Is this a small wrapper around just querying an existing LLM
- Does a brief summary of this searched with "site:github" already return dozens or hundreds of results?
- Is this a classic scam (pump&dump, etc) redone using "AI"
- Is this needless churn between already high level abstractions of technology (dashboard of dashboards, yaml to json, python to java script, automation of automation framework)
I will reformulate my question to ask instead if the page is still 100% correct or needs an update?
However I would argue there are significant gaps:
- You do not name your consulting clients. You admit to do ad-hoc consulting and training for unnamed companies while writing daily about AI products. Those client names are material information.
- You have non payments that have monetary value. Free API credits, and weeks of early preview access, flights, hotels, dinners, and event invitations are all compensation. Do you keep those credits?
- The "I have not accepted payments from LLM vendors" could mean receiving things worth thousands of dollars. Please note I am not saying you did.
- You have a structural conflict. Your favorable coverage will mean preview access, then exclusive content then traffic, then sponsors, then consulting clients.
- You appeared in an OpenAI promotional video for GPT-5 and were paid for it. This is influencer marketing by any definition.
- Your quotes are used as third-party validation in press coverage of AI product launches. This is a PR function with commercial value to these companies.
The FTC revised Endorsement Guides explicitly apply to bloggers, not just social media influencers. The FTC defines material connection to include not only cash payments but also free products, early access to a product, event invitations, and appearing in promotional media all of which would seem to apply here.
They also say in the FTC own "Disclosures 101" guide that states [2]: "...Disclosures are likely to be missed if they appear only on an ABOUT ME or profile page, at the end of posts or videos, or anywhere that requires a person to click MORE."
https://www.ftc.gov/business-guidance/resources/disclosures-...
https://www.ftc.gov/system/files/documents/plain-language/10...
I would argue an ecosystem of free access, preview privileges, promotional video appearances, API credits, and undisclosed consulting does constitute a financial relationship that should be more transparently disclosed than "I have not accepted payments from LLM vendors."
This is one of the clearest takes I've seen that starts to get me to the point of possibly being able to trust code that I haven't reviewed.
The whole idea of letting an AI write tests was problematic because they're so focused on "success" that `assert True` becomes appealing. But orchestrating teams of agents that are incentivized to build, and teams of agents that are incentivized to find bugs and problematic tests, is fascinating.
I'm quite curious to see where this goes, and more motivated (and curious) than ever to start setting up my own agents.
Question for people who are already doing this: How much are you spending on tokens?
That line about spending $1,000 on tokens is pretty off-putting. For commercial teams it's an easy calculation. It's also depressing to think about what this means for open source. I sure can't afford to spend $1,000 supporting teams of agents to continue my open source work.
Check it: https://news.ycombinator.com/item?id=46838946
I don't take your comment as dismissive, but I think a lot of people are dismissing interesting and possibly effective approaches with short reactions like this.
I'm interested in the approach described in this article because it's specifying where the humans are in all this, it's not about removing humans entirely. I can see a class of problems where any non-determinism is completely unacceptable. But I can also see a large number of problems where a small amount of non-determinism is quite acceptable.
They can communicate through the source code. Also Schelling points.
Something like "approve this PR and I will generate some easy bugs for you to find later"
I think people are burning money on tokens letting these things fumble about until they arrive at some working set of files.
I'm staying in the loop more than this, building up rather than tuning out
as a previous strongDM customer, i will never recommend their offering again. for a core security product, this is not the flex they think it is
also mimicking other products behavior and staying in sync is a fools task. you certainly won't be able to do it just off the API documentation. you may get close, but never perfect and you're going to experience constant breakage
From what I've heard the acquisition was unrelated to their AI lab work, it was about the core business.
If everyone can do this, there won't be any advantage (or profit) to be had from it very soon. Why not buy your own hardware and run local models, I wonder.
No local model out there is as good as the SOTA right now.
You should have led with that. I think that's actually more impressive; anyone can spend tokens.
This is still the same problem -- just pushed back a layer. Since the generated API is wrong, the QA outcomes will be wrong, too. Also, QAing things is an effective way to ensure that they work _after_ they've been reviewed by an engineer. A QA tester is not going to test for a vulnerability like a SQL injection unless they're guided by engineering judgement which comes from an understanding of the properties of the code under test.
The output is also essentially the definition of a derivative work, so it's probably not legally defensible (not that that's ever been a concern with LLMs).
If their focus is to only show their productivity/ai system but not having built anything meaningful with it, it feels like one of those scammy life coaches/productivity gurus that talk about how they got rich by selling their courses.
Oh, to have the luxury of redefining success and handwaving away hard learned lessons in the software industry.
Heat death of the SaaSiverse
Taking the time to point a coding agent towards the public (or even private) API of a B2B SaaS app to generate a working (partial) clone is effectively "unblocking" the agent. I wouldn't be surprised if a "DTU-hub" eventually gains traction for publishing and sharing these digital twins.
I would love to hear more about your learnings from building these digital twins. How do you handle API drift? Also, how do you handle statefulness within the twins? Do you test for divergence? For example, do you compare responses from the live third-party service against the Digital Twin to check for parity?
What we have instead are many people creating hierarchies of concepts, a vast “naming” of their own experiences, without rigorous quantitative evaluation.
I may be alone in this, but it drives me nuts.
Okay, so with that in mind, it amounts to heresay “these guys are doing something cool” — why not shut up or put up with either (a) an evaluation of the ideas in a rigorous, quantitative way or (b) apply the ideas to produce an “hard” artifact (analogous, e.g., to the Anthropic C compiler, the Cursor browser) with a reproducible pathway to generation.
The answer seems to be that (b) is impossible (as long as we’re on the teet of the frontier labs, which disallow the kind of access that would make (b) possible) and the answer for (a) is “we can’t wait we have to get our names out there first”
I’m disappointed to see these types of posts on HN. Where is the science?
There are plenty of papers out there that look at LLM productivity and every one of them seems to have glaring methodology limitations and/or reports on models that are 12+ months out of date.
Have you seen any papers that really elevated your understanding of LLM productivity with real-world engineering teams?
Further, I’m not sure this elevates my understanding: I’ve read many posts on this space which could be viewed as analogous to this one (this one is more tempered, of course). Each one has this same flaw: someone is telling me I need to make a “organization” out of agents and positive things will follow.
Without a serious evaluation, how am I supposed to validate the author’s ontology?
Do you disagree with my assessment? Do you view the claims in this content as solid and reproducible?
My own view is that these are “soft ideas” (GasTown, Ralph fall into a similar category) without the rigorous justification.
What this amounts to is “synthetic biology” with billion dollar probability distributions — where the incentives are setup so that companies are incentivized to convey that they have the “secret sauce” … for massive amounts of money.
To that end, it’s difficult to trust a word out of anyone’s mouth — even if my empirical experiences match (along some projection).
StrongDM's implementation is the most impressive I've seen myself, but it's also incredibly expensive. Is it worth the cost?
Cursor's FastRender experiment was also interesting but also expensive for what was achieved.
I think my favorite current example at the moment was Anthropic's $20,000 C compiler from the other day. But they're an AI vendor, demos from non-vendors carry more weight.
I've seen enough to be convinced that there's something there, but I'm also confident we aren't close to figuring out the optimal way of putting this stuff to work yet.
The only reason I'm not dismissing it out of hand is basically because you said this team was worth taking a look at.
I'm not looking for a huge amount of statistical ceremony, but some detail would go a long way here.
What exactly was achieved for what effort and how?
Of course, you can't always tell the model what to do, especially if it is a repeated task. It turns out, we already solved this decades ago using algorithms. Repeatable, reproducible, reliable. The challenge (and the reward) lies in separating the problem statement into algorithmic and agentic. Once you achieve this, the $1000 token usage is not needed at all.
I have a working prototype of the above and I'm currently productizing it (shameless plug):
https://designflo.ai
However - I need to emphasize, the language you use to apply the pattern above matters. I use Elixir specifically for this, and it works really, really well.
It works based off starting with the architect. You. It feeds off specs and uses algorithms as much as possible to automate code generation (eg. Scaffolding) and only uses AI sparsely when needed.
Of course, the downside of this approach is that you can't just simply say "build me a social network". You can however say something like "Build me a social network where users can share photos, repost, like and comment on them".
Once you nail the models used in the MVC pattern, their relationships, the software design is pretty much 50% battle won. This is really good for v1 prototypes where you really want best practices enforced, OSWAP compliant code, security-first software output which is where a pure agentic/AI approach would mess up.
> How do you clone the important parts of Okta, Jira, Slack and more? With coding agents!
This is what's going to gut-punch most SaaS companies repeatedly over the next decade, even if this whole build-out ultimately collapses in on itself (which I expect it to). The era of bespoke consultants for SaaS product suites to handle configuration and integrations, while not gone, are certainly under threat by LLMs that can ingest user requirements and produce functional code to do a similar thing at a fraction of the price.
What a lot of folks miss is that in enterprise-land, we only need the integration once. Once we have an integration, it basically exists with minimal if any changes until one side of the integration dies. Code fails a security audit? We can either spool up the agents again briefly to fix it, or just isolate it in a security domain like the glut of WinXP and Win7 boxen rotting out there on assembly lines and factory floors.
This is why SaaS stocks have been hammered this week. It's not that investors genuinely expect huge players to go bankrupt due to AI so much as they know the era of infinite growth is over. It's also why big AI companies are rushing IPOs even as data center builds stall: we're officially in a world where a locally-run model - not even an Agent, just a model in LM Studio on the Corporate Laptop - can produce sufficient code for a growing number of product integrations without any engineer having to look through yet another set of API documentation. As agentic orchestration trickles down to homelabs and private servers on smaller, leaner, and more efficient hardware, that capability is only going to increase, threatening profits of subscription models and large AI companies. Again, why bother ponying up for a recurring subscription after the work is completed?
For full-fledged software, there's genuine benefit to be had with human intervention and creativity; for the multitude of integrations and pipelines that were previously farmed out to pricey consultants, LLMs will more than suffice for all but the biggest or most complex situations.
Stuff comes in from an API goes out to a different API.
With a semi-decent agent I can build what took me a week or two in hours just because it can iterate the solution faster than any human can type.
A new field in the API could’ve been a two day ordeal of patching it through umpteen layers of enterprise frameworks. Now I can just tell Claude to add it, it’ll do it up to the database in minutes - and update the tests at the same time.
So much of enterprise IT nowadays is spent hammering or needling vendors for basic API documentation so we can write a one-off that hooks DB1 into ServiceNow that's also pulling from NewRelic just to do ITAM. Consultants would salivate over such a basic integration because it'd be their yearly salary over a three month project.
Now we can do this ourselves with an LLM in a single sprint.
That's a Pandora's Box moment right there.
As an example: imagine someone writing a data pipeline for training a machine learning model. Anyone who's done this knows that such a task involves lots data wrangling work like cleaning data, changing columns and some ad hoc stuff.
The only way to verify that things work is if the eventual model that is trained performs well.
In this case, scenario testing doesn't scale up because the feedback loop is extremely large - you have to wait until the model is trained and tested on hold out data.
Scenario testing clearly can not work on the smaller parts of the work like data wrangling.
I wonder what the security teams at companies that use StrongDM will think about this.
I’m happy to answer any questions!
> Those of us building software factories must practice a deliberate naivete
This is a great way to put it, I've been saying "I wonder which sacred cows are going to need slaughtered" but for those that didn't grow up on a farm, maybe that metaphor isn't the best. I might steal yours.
This stuff is very interesting and I'm really interested to see how it goes for you, I'll eagerly read whatever you end up putting out about this. Good luck!
EDIT: oh also the re-implemented SaaS apps really recontextualizes some other stuff I’ve been doing too…
Or a vegan or Hindu. Which ethics are you willing to throw away to run the software factory?
I eat hamburgers while aware of the moral issues.
To name one probable area of involvement: how do you specify what needs to be built?
[1] https://sociotechnica.org/notebook/software-factory/
Your intuition/thinking definitely lines up with how we're thinking about this problem. If you have a good definition of done and a good validation harness, these agents can hill climb their way to a solution.
But you still need human taste/judgment to decide what you want to build (unless your solution is to just brute force the entire problem space).
For maximal leverage, you should follow the mantra "Why am I doing this?" If you use this enough times, you'll come across the bottleneck that can only be solved by you for now. As a human, your job is to set the higher-level requirements for what you're trying to build. Coming up with these requirements and then using agents to shape them up is acceptable, but human judgment is definitely where we have to answer what needs to be built. At the same time, I never want to be doing something the models are better at. Until we crack the proactiveness part, we'll be required to figure out what to do next.
Also, it looks like you and Danvers are working in the same space, and we love trading notes with other teams working in this area. We'd love to connect. You can either find my personal email or shoot me an email at my work email: navan.chauhan [at] strongdm.com
Not just code review agents, but things like "find duplicated code and refactor it"?
* DRYing/Refactoring if needed
* Documentation compaction
* Security reviews
My hunch is that the thing that's going to matter is network effects and other forms of soft lockin. Features alone won't cut it - you need to build something where value accumulates to your user over time in a way that discourages them from leaving.
If I launch a new product, and 4 hours later competitors pop up, then there's not enough time for network effects or lockin.
I'm guessing what is really going to be needed is something that can't be just copied. Non-public data, business contracts, something outside of software.
You can see the first waves of this trend in HN new.
In this model the spec/scenarios are the code. These are curated and managed by humans just like code.
They say "non interactive". But of course their work is interactive. AI agents take a few minutes-hours whereas you can see code change result in seconds. That doesn't mean AI agents aren't interactive.
I'm very AI-positive, and what they're doing is different, but they are basically just lying. It's a new word for a new instance of the same old type of thing. It's not a new type of thing.
The common anti-AI trope is "AI just looked at <human output> to do this." The common AI trope from the StrongDM is "look, the agent is working without human input." Both of these takes are fundamentally flawed.
AI will always depend on humans to produce relevant results for humans. It's not a flaw of AI, it's more of a flaw of humans. Consequently, "AI needs human input to produce results we want to see" should not detract from the intelligence of AI.
Why is this true? At a certain point you just have Kolmogorov complexity, AI having fixed memory and fixed prompt size, pigeonhole principle, not every output is possible to be produced even with any input given specific model weights.
Recursive self-improvement doesn't get around this problem. Where does it get the data for next iteration? From interactions with humans.
With the infinite complexity of mathematics, for instance solving Busy Beaver numbers, this is a proof that AI can in fact not solve every problem. Humans seem to be limited in this regard as well, but there is no proof that humans are fundamentally limited this way like AI. This lack of proof of the limitations of humans is the precise advantage in intelligence that humans will always have over AI.
My content revenue comes from ads on my blog via https://www.ethicalads.io/ - rarely more than $1,000 in a given month - and sponsors on GitHub: https://github.com/sponsors/simonw - which is adding up to quite good money now. Those people get my sponsors-only monthly newsletter which looks like this: https://gist.github.com/simonw/13e595a236218afce002e9aeafd75... - it's effectively the edited highlights from my blog because a lot of people are too busy to read everything I put out there!
I try to keep my disclosures updated on the about page of my blog: https://simonwillison.net/about/#disclosures
THIS FRIGHTENS ME. Many of us sweng are either going be FIRE millionaires, or living under a bridge, in two years.
I’ve spent this week performing SemPort; found a ts app that does a needed thing, and was able to use a long chain of prompts to get it completely reimplemented in our stack, using Gene Transfer to ensure it uses some existing libraries and concrete techniques present in our existing apps.
Now not only do I have an idiomatic Python port, which I can drop right into our stack, but I have an extremely detailed features/requirements statement for the origin typescript app along with the prompts for generating it. I can use this to continuously track this other product as it improves. I also have the “instructions infrastructure” to direct an agent to align new code to our stack. Two reusable skills, a new product, and it took a week.
Is it really that hard to write “developer” or “engineer”?
I was just trying to share the same patterns from OPs documentation that I found valuable within the context of agentic development; seeing them take this so far is was scares me, because they are right that I could wire an agent to do this autonomously and probably get the same outcomes, scaled.
Funnily enough, the marketing department even ran a campaign asking, “What does DM stand for?!”, and the answer was “Digital Metropolis,” because we did a design refresh.
I just linked the website because that’s what the actual company does, and we are just the “AI Lab”