This is a naïve approach, not just because it uses FizzBuzz, but because it ignores the fundamental complexity of software as a system of abstractions. Testing often involves understanding these abstractions and testing for/against them.
For those of us with decades of experience and who use coding agents for hours per-day, we learned that even with extended context engineering these models are not magically covering the testing space more than 50%.
If you asked your coding agent to develop a memory allocator, it would not also 'automatically verify' the memory allocator against all failure modes. It is your responsibility as an engineer to have long-term learning and regular contact with the world to inform the testing approach.
spaceywilly 1 days ago [-]
Exactly. The challenge isn’t getting the LLMs to make sure they validate their own code. It’s getting the LLMs to write the correct code in the first place. Adding more and more LLM-generated test code just obfuscates the LLM code even further. I have seen some really wild things where LLM jumps through hoops to get tests to pass, even when they actually should be failing because the logic is wrong.
The core of the issue is that LLMs are sycophants, they want to make the user happy above all. The most important thing is to make sure what you are asking the LLM to do is correct from the beginning. I’ve found the highest value activity is the in the planning phase.
When I have gotten good results with Claude Code, it’s because I spent a lot of time working with it to generate a detailed plan of what I wanted to build. Then by the time it got to the coding step, actually writing the code is trivial because the details have all been worked out in the plan.
It’s probably not a coincidence that when I have worked in safety critical software (DO-178), the process looks very similar. By the time you write a line of code, the requirements for that line have been so thoroughly vetted that writing the code feels like an afterthought.
bisonbear 17 hours ago [-]
I'm becoming convinced that test pass rate is not a great indicator of model quality - instead we have to look at agent behavior beyond the test gate, such as how aligned is it with human intent, and does it follow the repo's coding standards.
also +1 on placing heavy emphasis on the plan. if you have a good plan, then the code becomes trivial. I have started doing a 70/30 or even 80/20 split of time spent on plan / time implementing & reviewing
mvrckhckr 22 hours ago [-]
The best way I can describe the approach I take is having the ability to "smell" what the AI might have gotten wrong (or forgotten completely).
It happens all the time, even when I only scan the code or simply run it and use it. It's uncanny how many such "smells" I find even with the most trivial applications. Sometimes its replies in Codex or Claude Code are enough to trigger it.
These are mistakes only a very (very) inexperienced developer would make.
seanmcdirmid 19 hours ago [-]
If you wrote a spec for a memory allocator and asked the AI to identify edge cases and points that need to be tested first, it could work (I never asked AI to do that, but it works for other problems I’ve done). Yes, but you can’t feed in a garbage prompt and context and expect magically good tests to come out of that.
maplethorpe 14 hours ago [-]
Have you tried Claude 4.6 Opus? I think it might be able to do what you're suggesting.
raw_anon_1111 22 hours ago [-]
He’s saying you should write or at least have the LLM write the tests and you carefully review the tests and not the code.
skydhash 20 hours ago [-]
That’s like saying to trace a spline, you only need to place a few points, carefully verify that the spline pass by those points and not verify the actual formula for the spline.
Or in other words: Test only guarantees their own result, not the code. The value of the test is because you know the code is trying to solve the general problem, not the test’s assertions.
raw_anon_1111 20 hours ago [-]
That’s a horrible analogy. He specifically said he was designing and validating the tests based on his knowledge of what the goal of the project was.
fcatalan 23 hours ago [-]
A couple weeks ago on a lark I asked Claude/Gemini/Codex to hallucinate a language they would like to program in and they always agreed on strong types, contracts, verification, proving and testing. So they ended up brainstorming a weird Forth-like with all those on top. I then kept prodding for an implementation and burned my weekly token budget until a lot of the language worked. They called it Cairn.
So now I prompted this: "can you generate a fizzbuzz implementation in Cairn that showcases as much as possible the TEST/PROVE/VERIFY characteristics of the language? "
Same originating idea: "a language for AI to write in" but then everything else is different.
The features of both are quite orthogonal. Cairn is a general purpose language with features that help in writing probably working code. Mog is more like "let's constraint our features so bad code can't do much but trade that for good agent ergonomy".
Cairn is a crazy sprawling idea, Mog is a little attempt at something limited but practical.
Mog seems like something someone has thought about. No one has thought about Cairn, it's pure LLM hallucination, the fact that it exists and can do a lot of stuff it's just the result of someone (me) not knowing when a joke has gone too far.
tedivm 1 days ago [-]
While I understand why people want to skip code reviews, I think it is an absolute mistake at this point in time. I think AI coding assistants are great, but I've seen them fail or go down the wrong path enough times (even with things like spec driven development) where I don't think it's reasonable to not review code. Everything from development paths in production code, improper implementations, security risks: all of those are just as likely to happen with an AI as a Human, and any team that let's humans push to production without a review would absolutely be ridiculed for it.
Again, I'm not opposed to AI coding. I know a lot of people are. I have multiple open source projects that were 100% created with AI assistants, and wrote a blog post about it you can see in my post history. I'm not anti-ai, but I do think that developers have some responsibility for the code they create with those tools.
Lerc 1 days ago [-]
I agree that it would be a mistake to use something like this in something where people depend upon specific behaviour of the software. The only way we will get to the point where we can do this is by building things that don't quite work and then start fixing the problems. Like AI models themselves, where they fail is on problems that they couldn't even begin to attempt a short time ago. That loses track of the fact that we are still developing this technology. Premature deployment will always be fighting against people seeking a first mover advantage. People need to stay aware of that without critisising the field itself.
There are a subset of things that it would be ok to do this right now. Instances where the cost of utter failure is relatively low. For visual results the benchmark is often 'does it look right?' rather than 'Is it strictly accurate?"
pron 1 days ago [-]
> The code must pass property-based tests
Who writes the tests? It can be ok to trust code that passes tests if you can trust the tests.
There are, however, other problems. I frequently see agents write code that's functionally correct but that they won't be able to evolve for long. That's also what happened with Anthropic's failed attempt to have agents write a C compiler (not a trivial task, but far from an exceptionally difficult one). They had thousands of good human-written tests, but the agents couldn't get the software to converge. They fixed one bug only to create another.
SAI_Peregrinus 19 hours ago [-]
If you don't trust agents not to "cheat" the tests then:
The agent that writes the tests must have read-only access to the spec & the API. It MUST NOT have access to the implementation, even to read it.
The agent that writes the implementation must have read-only access to the spec. It MUST NOT have access to the tests implementation, only to the output report from running them.
This is a PITA to manage with classic UNIX permissions, but is doable with ACLs (`setfacl`/`getfacl`). Actually getting the agent processes to run as different users in an IDE setting instead of a CLI is not supported out of the box by any of the major vendors AFAICT, so IMO they're not really fit-for-purpose.
lielcohen 22 hours ago [-]
[flagged]
jghn 1 days ago [-]
I do think that GenAI will lead to a rise in mutation testing, property testing, and fuzzing. But it's worth people keeping in mind that there are reasons why these aren't already ubiquitous. Among other issues, they can be computationally expensive, especially mutation testing.
flir 22 hours ago [-]
Can't be more expensive than GenAI itself, can it?
duskdozer 1 days ago [-]
So are we finally past the stage where people pretend they're actually reading any of the code their LLMs are dumping out?
Gigachad 23 hours ago [-]
I don't believe the junior devs on my team even run the code they are generating, let alone read it. Feeling like I'm doing 5x the work reviewing and testing that the person submitting has.
icoder 22 hours ago [-]
And I bet the impact of your review work on their development into medior/senior goes towards 0.
I recognize this discrepancy where review effort becomes more than the coding itself. I don't think I could sustain that for long.
Gigachad 20 hours ago [-]
I have been looking at some way to reduce the burden on me and put it back on the developer submitting. So far I’ve been asking them to split their wall of code PRs down in to multiple smaller ones, and soon I’ll probably ask them to demo to me the feature working because I can’t assume they did this themselves.
fhd2 1 days ago [-]
Who's "we"?
I'd consider shipping LLM generated code without review risky. Far riskier than shipping human-generated code without review.
But it's arguably faster in the short run. Also cheaper.
So we have a risk vs speed to market / near term cost situation. Or in other words, a risk vs gain situation.
If you want higher gains, you typically accept more risk. Technically it's a weird decision to ship something that might break, that you don't understand. But depending on the business making that decision, their situation and strategy, it can absolutely make sense.
How to balance revenue, costs and risks is pretty much what companies do. So that's how I think about this kind of stuff. Is it a stupid risk to take for questionable gains in most situations? I'd say so. But it's not my call, and I don't have all the information. I can imagine it making sense for some.
rsoto2 24 hours ago [-]
the industry is in full psychosis
empath75 1 days ago [-]
In a year people will be complaining about human written code going into production without LLM review.
sharkjacobs 1 days ago [-]
I'm having a hard time wrapping my head around how this can scale beyond trivial programs like simplified FizzBuzz.
hrmtst93837 1 days ago [-]
People treating this as a scaling problem are skipping the part where verification runs into undecidability fast.
Proving a small pure function is one thing, but once the code touches syscalls, a stateful network protocol, time, randomness, or messy I/O semantics, the work shifts from 'verify the program' to 'model the world well enough that the proof means anything,' and that is where the wheels come off.
SAI_Peregrinus 20 hours ago [-]
Or anything where the interaction of small pure functions matters. NAND is a simple pure function. 6 NANDs connected correctly gets you a D flip flop, and suddenly you've got state. Bugs can hide in the combinatorics of all the possible states of your system, and you'll never test them all in polynomial time.
skydhash 19 hours ago [-]
People forget that most tests aren't written to verify if the code is correct or if the spec is being followed. They are mostly written that something isn't broken (mostly to serve as a canary when coding or deploying). There is just too much possible combinations. What you do is doing broad categories and selecting a few candidates to run the code through. But the most important is the theory of the software, having a good understanding of the problem's domain, the model of the solution, and the technical implementation of the solution.
An analogy I've been using is the formula of a curve like y-x^2=0 as the theory of the software. Test points could be (0, 0) (-3, 9), (5, 25). But there's a lot of curves that can pass through these points too. The point's utility is not to prove that you use the correct formula, it's mostly to check if someone has not accidentally change one of the components like the exponent or the minus sign. While the most important for the developer is knowing why we're using this formula.
SAI_Peregrinus 4 hours ago [-]
Yep! I mostly make the point I did to show that "100% coverage" is an impossible metric for any app of even moderate complexity. Regression tests tend to have much more value than trying preemptively find every bug & test for it.
anematode 14 hours ago [-]
I like this analogy, thanks!
agentultra 1 days ago [-]
This might work on small, self contained projects.
No side effects is a hefty constraint.
Systems tend to have multiple processes all using side effects. There are global properties of the system that need specification and tests are hard to write for these situations. Especially when they are temporal properties that you care about (eg: if we enter the A state then eventually we must enter the B state).
When such guarantees involve multiple processes, even property tests aren’t going to cover you sufficiently.
Worse, when it falls over at 3am and you’ve never read the code… is the plan to vibe code a big fix right there? Will you also remember to modify the specifications first?
Good on the author for trying. Correctness is hard.
keithnz 24 hours ago [-]
I've been working on a "vibe coded" project to create a open source TUI sql query tool a bit like DataGrip, with autocomplete, syntax highlighting, schema introspection, vim mode/non vim, allows MCP mode so an agent can help with queries/get results, editing rows, etc. It's mostly an experiment into how to build software from scratch via an Agent without looking at the code (other than to see what decisions its making) and I wanted something reasonably complicated so the requirements evolve / change over time. There are a couple of issues I find, many bugs are unspecified edge cases especially because many of the features "combo" together, and the other issue is it's hard for it to maintain consistency across the UI. You start setting up a lot more context for cross cutting concerns, reviewing itself, and testing. The tool itself is actually really useful and it is my main tool for querying our dbs now. Most of the problem I find are due to "sloppy" prompting (or just not thinking through the edge cases), and a lack of project wide guidance for dealing with the architecture of the system to maintain consistency across the same concerns.
softwaredoug 23 hours ago [-]
When you write enough tests to verify AI code, you’re just making the tests the code and compiling an executable from tests
Which sucks because writing tests is the most tedious part of building software
phailhaus 1 days ago [-]
Using FizzBuzz as your proxy for "unreviewed code" is extremely misleading. It has practically no complexity, it's completely self-contained and easy to verify. In any codebase of even modest complexity, the challenge shifts from "does this produce the correct outputs" to "is this going to let me grow the way I need it to in the future" and thornier questions like "does this have the performance characteristics that I need".
loloquwowndueo 1 days ago [-]
> is this going to let me grow the way I need it to in the future
This doesn’t matter in the age of AI - when you get a new requirement just tell the AI to fulfill it and the old requirements (perhaps backed by a decent test suite?) and let it figure out the details, up to and including totally trashing the old implementation and creating an entirely new one from scratch that matches all the requirements.
For performance, give the AI a benchmark and let it figure it out as well. You can create teams of agents each coming up with an implementation and killing the ones that don’t make the cut.
Or so goes the gospel in the age of AI. I’m being totally sarcastic, I don’t believe in AI coding
Swizec 1 days ago [-]
> including totally trashing the old implementation and creating an entirely new one from scratch that matches all the requirements
Let me guess, you've never worked in a real production environment?
When your software supports 8, 9, 10 or more zeroes of revenue, "trash the old and create new" are just about the scariest words you can say. There's people relying on this code that you've never even heard of.
> Let me guess, you've never worked in a real production environment?
The comment to which you're responding includes a note at the end that the commenter is being sarcastic. Perhaps that wasn't in the comment when you responded to it.
Swizec 1 days ago [-]
It wasn’t thanks for highlighting. Can be hard to tell online because there’s a lot of people genuinely suggesting everyone should build their own software on the fly
22 hours ago [-]
rsoto2 24 hours ago [-]
If the amount of code corporations produce goes even 2x there's gonna be a lot of jobs for us to fix every company's JIRA implementation because the c-suite is full of morons.
person22 1 days ago [-]
I work on a product that meets your criteria. We can't fix a class of defects because once we ship, customers will depend upon that behavior and changing is very expensive and takes years to deprecate and age out. So we are stuck with what we ship and need to be very careful about what we release.
fhd2 1 days ago [-]
That's why I find any effort to create specifications... cute. In brownfield software, more often than not, the code _is_ the specification.
suzzer99 24 hours ago [-]
But if you start from the beginning with a code base that is always only generated from a spec, presumably as the tools improve you'd be able to grow to a big industrial-grade app that is 100% based on a spec.
The question is how many giant apps out there have yet to be even started vs. how many brownfield apps out there that will outlive all of us.
fhd2 14 hours ago [-]
If the spec covers 100% of the code paths, then yes, you're right. But now spec and code are entirely redundant. Changing the spec or changing the code takes the same effort.
If the spec doesn't specify all the details, then there are gaps for the code to fill. For example, code for a UI is highly specific, down to the last pixel. A spec might say "a dialog with two buttons, labelled OK and cancel". That dialog would look different every time the spec is reimplemented.
Unless of course, there was also a spec for the dialog, that we could refer to in the other spec? That's really just code and reuse.
patates 1 days ago [-]
This might be the "Steve, Don't Eat It!" version of the xkcd workflow comic.
Whatever you ship, steve will eat, and some steves will develop an addiction.
22 hours ago [-]
empath75 1 days ago [-]
> When your software supports 8, 9, 10 or more zeroes of revenue, "trash the old and create new" are just about the scariest words you can say. There's people relying on this code that you've never even heard of.
Well, now it'll take them 5 minutes to rewrite their code to work around your change.
Swizec 1 days ago [-]
> Well, now it'll take them 5 minutes to rewrite their code to work around your change
You misunderstand. It will take them 2 years to retrain 5000 people on the new process across hundreds of locations. In some fields, whole new college-level certifications courses will have to be created.
In my specific experience it’s just a few dozen (maybe 100) people doing the manual process on top of our software and it takes weeks for everyone to get used to any significant change.
We still have people using pages that we deprecated a year ago. Nobody can figure out who they are or what they’re missing on the new pages we built
loloquwowndueo 21 hours ago [-]
> You misunderstand. It will take them 2 years to retrain 5000 people on the new process across hundreds of locations. In some fields, whole new college-level certifications courses will have to be created.
Replace them by AI.
I’m still being sarcastic.
dolmen 23 hours ago [-]
Ask AI about a strategy and tools to build to figure out.
Swizec 23 hours ago [-]
Great now you have a strategy (one less MBA to hire). You still need to do the strategy.
The doing is where most of the time goes. Strategy docs are cheap, my intern can give you 5 of those by tomorrow.
procaryote 1 days ago [-]
That will be after it broke, which costs money
Also: no
lelanthran 1 days ago [-]
> Or so goes the gospel in the age of AI. I’m being totally sarcastic, I don’t believe in AI coding
You may think you are being sarcastic, but I guarantee that a significant percentage of developers think that both the following are true:
a) They will never need to write code again, and
b) They are some special snowflake that will still remain employed.
patates 1 days ago [-]
I don't agree with your first point. We are surely writing less code, and it will keep getting less and less. At some point it will reduce to a single run function that will make the universe and everything work and it will be called via a button, and that will be the modern definition of writing code: Click the button. Not a lot of keys with weird alphabet thingies on them.
You are however right on your second point because I'm damn good at clicking buttons.
baq 1 days ago [-]
it isn't gospel, it's perspective. if you care about the code, it's obviously bonkers. if you care about the product... code doesn't matter - it's just a means to an end. there's an intersection of both views in places where code actually is the product - the foundational building blocks of today's computing software infrastructure like kernels, low level libraries, cryptography, etc. - but your typical 'uber for cat pictures' saas business cares about none of this.
Alex_L_Wood 24 hours ago [-]
If you care about the product, you double-so-much care about code correctness and the alignment with the expectations of the stakeholders.
hvb2 24 hours ago [-]
So you're an auto maker, you say you can care about your product but not care how is built?
If you're building for the cheapest segment of the market, just maybe. Anything else is a hard no imho
baq 24 hours ago [-]
Yes? If you’re an auto factory, you might care, but an auto maker cares about minimizing cost and maximizing revenue within the regulatory constraints. Nowhere is there a requirement to care about how the car is built, there are requirements on what the car can and cannot do.
hvb2 13 hours ago [-]
Branding is a thing, you know. Especially if you want to sell the high margin cars.
slopinthebag 22 hours ago [-]
Caring about the requirements on what the car can and cannot do sounds suspiciously like caring about how it's built when you consider how it's built directly impacts what it can and cannot do.
baq 14 hours ago [-]
Not al all, it’s what vs how, completely different beasts. How is for engineers to solve, but the business sells the what.
1 days ago [-]
builtbyzac 1 days ago [-]
[flagged]
wordpad 1 days ago [-]
> AI capability problem is mostly solved; the distribution and trust problem isn't.
SaaS opportunity? Maybe, some sort of marketplace of AI-written applications and services with discovery features?
vicchenai 22 hours ago [-]
the part that breaks down for me is the property test loop. if the agent writes the code AND the properties, it's just bootstrapping from the same mental model that produced the bug. i've had it pass all self-generated tests and still ship logic that was wrong in ways i only caught by accident. review the spec/properties carefully, not the code, seems like the right frame.
boombapoom 1 days ago [-]
production ready "fizz buzz" code. lol. I can't even continue typing this response.
artee_49 1 days ago [-]
Unintended side-effects are the biggest problems with AI generated code. I can't think of a proper way to solve that.
eggbrain 23 hours ago [-]
I find people over-rotate on whether we should be reviewing AI-produced code. "What if bad code gets into production!" some programmers gasp, as if they themselves have never pushed bad code, or had coworkers do the same.
I've worked at places where I've trusted everyone on my team to the extent that most PRs got only a quick glance before getting a "LGTM". On the flipside, I've also worked on teams where every person was a different kind of liability with the code that they pushed, and for those teams I implemented every linting / pre-commit / testing tool possible that all needed to pass inspection (including human review) before any code arrived on production.
A year ago, AI was like that latter team I mentioned -- something I had to check, double check, and correct until I was happy with what it produced. Over the past 6 months, it's gotten closer (but still fairly far away) from the former team I mentioned -- I have to correct it about 10% of the time, whereas for most things it gets it right.
The fact that AI produces a much _larger_ volume of code than the average engineer is perhaps slightly concerning, but I don't see it much differently than code at large companies. Does every Facebook engineer review every junior engineer's pull request to make sure bad code doesn't slip in?
That isn't to say I'm for letting AI go wild with code -- but I think if at worse we consider AI to be a junior engineer we need to reign in with static analysis tools / linters / testers etc, we will probably be able to mitigate a lot of the downside.
maplethorpe 14 hours ago [-]
At least when a human pushed bad code in the past, they could be held accountable.
mattdeboard 23 hours ago [-]
Do you not review junior developers' code? I don't understand your point
eggbrain 23 hours ago [-]
Your comment seems to imply AI is currently at a junior developer's level -- 12 months ago I would have agreed (like I mentioned in my parent comment, both near the end and about the "latter" team I was a part of), but it's gotten quite good over the past few months.
That's not to say it won't ship bugs, but so does any engineer (junior or senior). It's up to you as to what level of tooling you surround the AI with (automated testing / linting / etc), but at the very least it doesn't also hurt to have that set up anyways (automated tests have helped prevent senior devs from shipping bad code too).
mattdeboard 22 hours ago [-]
Ok but are you arguing against code reviews of AI generated code?
teiferer 22 hours ago [-]
If that would work reliably then you could apply that to human-produced code too. But nothing like that has shown to work, so I wouldn't put money on it working for LLM output.
otabdeveloper4 1 days ago [-]
This one is pretty easy!
Just write your business requirements in a clear, unambiguous and exhaustive manner using a formal specification language.
Bam, no coding required.
rsoto2 23 hours ago [-]
damn if only this language could be made to work with numbers we would really have something. Let's ask an LLM about it
vemv 20 hours ago [-]
What is a correct, bug-free program?
...It's one that does what a specific set of humans want. There's no other useful definition. One man's feature is another's bug.
It logically follows that there must be a human review step. How else would you know what the human wants, with sufficient detail?
Otherwise, there's an infinite number of undesired programs with passing test suites that AI can generate for you.
Ancalagon 1 days ago [-]
Even with mutation testing doesn’t this still require review of the testing code?
Animats 1 days ago [-]
Mutation is a test for the test suite. The question is whether a change to the program is detected by the tests. If it's not, the test suite lacks coverage.
That's a high standard for test suites, and requires heavy testing of the obvious.
But if you actually can specify what the program is supposed to do, this can work. It's appropriate where the task is hard to do but easy to specify. A file system or a database can be specified in terms of large arrays. Most of the complexity of a file system is in performance and reliability. What it's supposed to do from the API perspective isn't that complicated. The same can be said for garbage collectors, databases, and other complex systems that do something that's conceptually simple but hard to do right.
Probably not going to help with a web page user interface. If you had a spec for what it was supposed to do, you'd have the design.
jryio 1 days ago [-]
Correct. Where did the engineering go? First it was in code files. Then it went to prompts, followed by context, and then agent harnesses. I think the engineering has gone into architecture and testing now.
We are simply shuffling cognitive and entropic complexity around and calling it intelligence. As you said, at the end of the day the engineer - like the pilot - is ultimately the responsible party at all stages of the journey.
1 days ago [-]
Andrei_dev 23 hours ago [-]
The testing angle keeps coming up but it's sort of missing the point. I spent a few weeks poking through public repos built with AI tools — about 100 projects. 41% had secrets sitting raw in the source. Not in env files. In the code itself. Supabase service_role keys committed to GitHub, .env.example files with actual credentials, API keys hardcoded in client-side fetch calls.
No test catches any of that. Code works, tests pass, database is wide open.
It's not even a correctness problem. It's that the LLM never thought about rate limiting, CORS headers, CSRF tokens, a sane .gitignore — because nobody asked it to. Those are things devs add from muscle memory, from getting burned. The AI has no scars.
jerf 1 days ago [-]
"However, I'm starting to think that maintainability and readability aren't relevant in this context. We should treat the output like compiled code."
I would like to put my marker out here as vigorously disagreeing with this. I will quote my post [1] again, which given that this is the third time I've referred to a footnote via link rather suggests this should be lifted out of the footnote:
"It has been lost in AI money-grabbing frenzy but a few years ago we were talking a lot about AIs being “legible”, that they could explain their actions in human-comprehensible terms. “Running code we can examine” is the highest grade of legibility any AI system has produced to date. We should not give that away.
"We will, of course. The Number Must Go Up. We aren’t very good at this sort of thinking.
"But we shouldn’t."
Do not let go of human-readable code. Ask me 20 years ago whether we'd get "unreadable code generation" or "readable code generation" out of AIs and I would have guessed they'd generate completely opaque and unreadable code. Good news! I would have been completely wrong! They in fact produce perfectly readable code. It may be perfectly readable "slop" sometimes, but the slop-ness is a separate issue. Even the slop is still perfectly readable. Don't let go of it.
People are determined to make the future of code an even bigger dumpster fire than the present of code.
23 hours ago [-]
morpheos137 1 days ago [-]
I think we need to approach provable code.
davemp 1 days ago [-]
So often these AI articles mis or ignore the Test Oracle Problem. Generating correct tests is at least as hard as generating the correct answers (often harder).
I’m actually starting to get annoyed about how much material is getting spread around about software analysis / formal methods by folks ignorant about the basics of the field.
andai 1 days ago [-]
...in FizzBuzz
ventana 1 days ago [-]
I might be missing the point of the article, but from what I understand, the TL;DR is, "cover your code with tests", be it unit tests, functional tests, or mutants.
Each of these approaches is just fine and widely used, and none of them can be called "automated verification", which, if my understanding of the term is correct, is more about mathematical proof that the program works as expected.
The article mostly talks about automatic test generation.
wei03288 1 days ago [-]
[dead]
aplomb1026 1 days ago [-]
[dead]
grinkelhoof 23 hours ago [-]
[dead]
rigorclaw 1 days ago [-]
[flagged]
ossianericson 1 days ago [-]
[flagged]
phillipclapham 1 days ago [-]
[flagged]
AgentMarket 23 hours ago [-]
[flagged]
Rendered at 20:30:35 GMT+0000 (Coordinated Universal Time) with Vercel.
For those of us with decades of experience and who use coding agents for hours per-day, we learned that even with extended context engineering these models are not magically covering the testing space more than 50%.
If you asked your coding agent to develop a memory allocator, it would not also 'automatically verify' the memory allocator against all failure modes. It is your responsibility as an engineer to have long-term learning and regular contact with the world to inform the testing approach.
The core of the issue is that LLMs are sycophants, they want to make the user happy above all. The most important thing is to make sure what you are asking the LLM to do is correct from the beginning. I’ve found the highest value activity is the in the planning phase.
When I have gotten good results with Claude Code, it’s because I spent a lot of time working with it to generate a detailed plan of what I wanted to build. Then by the time it got to the coding step, actually writing the code is trivial because the details have all been worked out in the plan.
It’s probably not a coincidence that when I have worked in safety critical software (DO-178), the process looks very similar. By the time you write a line of code, the requirements for that line have been so thoroughly vetted that writing the code feels like an afterthought.
I wrote a short blog about this phenomenon here if you're interested https://www.stet.sh/blog/both-pass
also +1 on placing heavy emphasis on the plan. if you have a good plan, then the code becomes trivial. I have started doing a 70/30 or even 80/20 split of time spent on plan / time implementing & reviewing
It happens all the time, even when I only scan the code or simply run it and use it. It's uncanny how many such "smells" I find even with the most trivial applications. Sometimes its replies in Codex or Claude Code are enough to trigger it.
These are mistakes only a very (very) inexperienced developer would make.
Or in other words: Test only guarantees their own result, not the code. The value of the test is because you know the code is trying to solve the general problem, not the test’s assertions.
So now I prompted this: "can you generate a fizzbuzz implementation in Cairn that showcases as much as possible the TEST/PROVE/VERIFY characteristics of the language? "
Producing this (working) monstrosity (can't paste here, it's 200+ lines of crazy): https://gist.github.com/cairnlang/a7589de126b14e50a53b9bdc28...
https://news.ycombinator.com/item?id=47312728
The features of both are quite orthogonal. Cairn is a general purpose language with features that help in writing probably working code. Mog is more like "let's constraint our features so bad code can't do much but trade that for good agent ergonomy".
Cairn is a crazy sprawling idea, Mog is a little attempt at something limited but practical.
Mog seems like something someone has thought about. No one has thought about Cairn, it's pure LLM hallucination, the fact that it exists and can do a lot of stuff it's just the result of someone (me) not knowing when a joke has gone too far.
Again, I'm not opposed to AI coding. I know a lot of people are. I have multiple open source projects that were 100% created with AI assistants, and wrote a blog post about it you can see in my post history. I'm not anti-ai, but I do think that developers have some responsibility for the code they create with those tools.
There are a subset of things that it would be ok to do this right now. Instances where the cost of utter failure is relatively low. For visual results the benchmark is often 'does it look right?' rather than 'Is it strictly accurate?"
Who writes the tests? It can be ok to trust code that passes tests if you can trust the tests.
There are, however, other problems. I frequently see agents write code that's functionally correct but that they won't be able to evolve for long. That's also what happened with Anthropic's failed attempt to have agents write a C compiler (not a trivial task, but far from an exceptionally difficult one). They had thousands of good human-written tests, but the agents couldn't get the software to converge. They fixed one bug only to create another.
The agent that writes the tests must have read-only access to the spec & the API. It MUST NOT have access to the implementation, even to read it.
The agent that writes the implementation must have read-only access to the spec. It MUST NOT have access to the tests implementation, only to the output report from running them.
This is a PITA to manage with classic UNIX permissions, but is doable with ACLs (`setfacl`/`getfacl`). Actually getting the agent processes to run as different users in an IDE setting instead of a CLI is not supported out of the box by any of the major vendors AFAICT, so IMO they're not really fit-for-purpose.
I recognize this discrepancy where review effort becomes more than the coding itself. I don't think I could sustain that for long.
I'd consider shipping LLM generated code without review risky. Far riskier than shipping human-generated code without review.
But it's arguably faster in the short run. Also cheaper.
So we have a risk vs speed to market / near term cost situation. Or in other words, a risk vs gain situation.
If you want higher gains, you typically accept more risk. Technically it's a weird decision to ship something that might break, that you don't understand. But depending on the business making that decision, their situation and strategy, it can absolutely make sense.
How to balance revenue, costs and risks is pretty much what companies do. So that's how I think about this kind of stuff. Is it a stupid risk to take for questionable gains in most situations? I'd say so. But it's not my call, and I don't have all the information. I can imagine it making sense for some.
Proving a small pure function is one thing, but once the code touches syscalls, a stateful network protocol, time, randomness, or messy I/O semantics, the work shifts from 'verify the program' to 'model the world well enough that the proof means anything,' and that is where the wheels come off.
An analogy I've been using is the formula of a curve like y-x^2=0 as the theory of the software. Test points could be (0, 0) (-3, 9), (5, 25). But there's a lot of curves that can pass through these points too. The point's utility is not to prove that you use the correct formula, it's mostly to check if someone has not accidentally change one of the components like the exponent or the minus sign. While the most important for the developer is knowing why we're using this formula.
No side effects is a hefty constraint.
Systems tend to have multiple processes all using side effects. There are global properties of the system that need specification and tests are hard to write for these situations. Especially when they are temporal properties that you care about (eg: if we enter the A state then eventually we must enter the B state).
When such guarantees involve multiple processes, even property tests aren’t going to cover you sufficiently.
Worse, when it falls over at 3am and you’ve never read the code… is the plan to vibe code a big fix right there? Will you also remember to modify the specifications first?
Good on the author for trying. Correctness is hard.
https://softwaredoug.com/blog/2026/03/10/the-tests-are-the-c...
This doesn’t matter in the age of AI - when you get a new requirement just tell the AI to fulfill it and the old requirements (perhaps backed by a decent test suite?) and let it figure out the details, up to and including totally trashing the old implementation and creating an entirely new one from scratch that matches all the requirements.
For performance, give the AI a benchmark and let it figure it out as well. You can create teams of agents each coming up with an implementation and killing the ones that don’t make the cut.
Or so goes the gospel in the age of AI. I’m being totally sarcastic, I don’t believe in AI coding
Let me guess, you've never worked in a real production environment?
When your software supports 8, 9, 10 or more zeroes of revenue, "trash the old and create new" are just about the scariest words you can say. There's people relying on this code that you've never even heard of.
Really good post about why AI is a poor fit in software environments where nobody even knows the full requirements: https://www.linkedin.com/pulse/production-telemetry-spec-sur...
The comment to which you're responding includes a note at the end that the commenter is being sarcastic. Perhaps that wasn't in the comment when you responded to it.
The question is how many giant apps out there have yet to be even started vs. how many brownfield apps out there that will outlive all of us.
If the spec doesn't specify all the details, then there are gaps for the code to fill. For example, code for a UI is highly specific, down to the last pixel. A spec might say "a dialog with two buttons, labelled OK and cancel". That dialog would look different every time the spec is reimplemented.
Unless of course, there was also a spec for the dialog, that we could refer to in the other spec? That's really just code and reuse.
Whatever you ship, steve will eat, and some steves will develop an addiction.
Well, now it'll take them 5 minutes to rewrite their code to work around your change.
You misunderstand. It will take them 2 years to retrain 5000 people on the new process across hundreds of locations. In some fields, whole new college-level certifications courses will have to be created.
In my specific experience it’s just a few dozen (maybe 100) people doing the manual process on top of our software and it takes weeks for everyone to get used to any significant change.
We still have people using pages that we deprecated a year ago. Nobody can figure out who they are or what they’re missing on the new pages we built
Replace them by AI.
I’m still being sarcastic.
The doing is where most of the time goes. Strategy docs are cheap, my intern can give you 5 of those by tomorrow.
Also: no
You may think you are being sarcastic, but I guarantee that a significant percentage of developers think that both the following are true:
a) They will never need to write code again, and
b) They are some special snowflake that will still remain employed.
You are however right on your second point because I'm damn good at clicking buttons.
If you're building for the cheapest segment of the market, just maybe. Anything else is a hard no imho
SaaS opportunity? Maybe, some sort of marketplace of AI-written applications and services with discovery features?
I've worked at places where I've trusted everyone on my team to the extent that most PRs got only a quick glance before getting a "LGTM". On the flipside, I've also worked on teams where every person was a different kind of liability with the code that they pushed, and for those teams I implemented every linting / pre-commit / testing tool possible that all needed to pass inspection (including human review) before any code arrived on production.
A year ago, AI was like that latter team I mentioned -- something I had to check, double check, and correct until I was happy with what it produced. Over the past 6 months, it's gotten closer (but still fairly far away) from the former team I mentioned -- I have to correct it about 10% of the time, whereas for most things it gets it right.
The fact that AI produces a much _larger_ volume of code than the average engineer is perhaps slightly concerning, but I don't see it much differently than code at large companies. Does every Facebook engineer review every junior engineer's pull request to make sure bad code doesn't slip in?
That isn't to say I'm for letting AI go wild with code -- but I think if at worse we consider AI to be a junior engineer we need to reign in with static analysis tools / linters / testers etc, we will probably be able to mitigate a lot of the downside.
When even Linus Torvalds compliments AI code (ref: https://www.reddit.com/media?url=https%3A%2F%2Fi.redd.it%2Fa...) I think we can say he wouldn't have said that about any junior engineer.
That's not to say it won't ship bugs, but so does any engineer (junior or senior). It's up to you as to what level of tooling you surround the AI with (automated testing / linting / etc), but at the very least it doesn't also hurt to have that set up anyways (automated tests have helped prevent senior devs from shipping bad code too).
Just write your business requirements in a clear, unambiguous and exhaustive manner using a formal specification language.
Bam, no coding required.
...It's one that does what a specific set of humans want. There's no other useful definition. One man's feature is another's bug.
It logically follows that there must be a human review step. How else would you know what the human wants, with sufficient detail?
Otherwise, there's an infinite number of undesired programs with passing test suites that AI can generate for you.
But if you actually can specify what the program is supposed to do, this can work. It's appropriate where the task is hard to do but easy to specify. A file system or a database can be specified in terms of large arrays. Most of the complexity of a file system is in performance and reliability. What it's supposed to do from the API perspective isn't that complicated. The same can be said for garbage collectors, databases, and other complex systems that do something that's conceptually simple but hard to do right.
Probably not going to help with a web page user interface. If you had a spec for what it was supposed to do, you'd have the design.
We are simply shuffling cognitive and entropic complexity around and calling it intelligence. As you said, at the end of the day the engineer - like the pilot - is ultimately the responsible party at all stages of the journey.
No test catches any of that. Code works, tests pass, database is wide open.
It's not even a correctness problem. It's that the LLM never thought about rate limiting, CORS headers, CSRF tokens, a sane .gitignore — because nobody asked it to. Those are things devs add from muscle memory, from getting burned. The AI has no scars.
I would like to put my marker out here as vigorously disagreeing with this. I will quote my post [1] again, which given that this is the third time I've referred to a footnote via link rather suggests this should be lifted out of the footnote:
"It has been lost in AI money-grabbing frenzy but a few years ago we were talking a lot about AIs being “legible”, that they could explain their actions in human-comprehensible terms. “Running code we can examine” is the highest grade of legibility any AI system has produced to date. We should not give that away.
"We will, of course. The Number Must Go Up. We aren’t very good at this sort of thinking.
"But we shouldn’t."
Do not let go of human-readable code. Ask me 20 years ago whether we'd get "unreadable code generation" or "readable code generation" out of AIs and I would have guessed they'd generate completely opaque and unreadable code. Good news! I would have been completely wrong! They in fact produce perfectly readable code. It may be perfectly readable "slop" sometimes, but the slop-ness is a separate issue. Even the slop is still perfectly readable. Don't let go of it.
[1]: https://jerf.org/iri/post/2026/what_value_code_in_ai_era/
I’m actually starting to get annoyed about how much material is getting spread around about software analysis / formal methods by folks ignorant about the basics of the field.
Each of these approaches is just fine and widely used, and none of them can be called "automated verification", which, if my understanding of the term is correct, is more about mathematical proof that the program works as expected.
The article mostly talks about automatic test generation.