Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲ProgramBench: Can language models rebuild programs from scratch? (arxiv.org)

148 points by jonbaer 1 days ago | 74 comments

weinzierl 1 days ago [-]

"Models favor monolithic, single-file implementations that diverge sharply from human-written code."

You say! I might have been just an LLM all along without even knowing it since I too prefer single file implementations.

Back in the old VB5/VB6 days Visual Studio had this mode where it showed the different functions in a file almost as if they were separate files. You could not scroll beyond the functions end but you could easily transition between that mode and global file view. I always found that a nice way of working (but admittedly the world was a lot simpler back then).

Also my preference for fewer but longer files is only there when I write the code myself. For working with AI I think smaller files are beneficial for quicker turn around between human and machine.

tmtvl 18 hours ago [-]

How often has there been a HN submission for a project 'in a single C header file'?

kibwen 16 hours ago [-]

This has less to do with natural opinions regarding code organization and more to do with the fact that including, modularizing, and distributing C code has historically been a pain in the ass which is ameliorated by shoving everything into a single file.

weinzierl 16 hours ago [-]

EDIT: Sorry, I missed the "header" part (and the irony).

At least once, here you go:

https://news.ycombinator.com/item?id=48053570

Ok, I just submitted it myself but I could not believe it never had been submitted before. It is from 1997 and was pretty popular for some time. I think it even was built into Google Picasa for sime time.

westurner 16 hours ago [-]

/? single header https://hn.algolia.com/?q=single%20header

/? header-only https://hn.algolia.com/?q=header-only

/? c header https://hn.algolia.com/?q=c+header

rullopat 23 hours ago [-]

I think it's one (but not the only) reason that makes LLMs work very well with Ruby on Rails

bmn__ 24 hours ago [-]

This VB feature existed to accommodate programmers coming from the DOS based QB IDE who were used to the one function per screen view there. To my sensibilities, it does not make much sense with the advent of high-resolution desktop environments.

jongjong 21 hours ago [-]

This has been my preference as well. I build everything in one file until it becomes uncomfortable and only then I start breaking up into multiple files... But even then, I try to keep the main business logic fully visible in the main file.

killerstorm 22 hours ago [-]

It's a very misleading: they don't provide any meaningful documentation/requirements. Just an executable blackbox.

E.g. a doc for ffmpeg, which I checked by downloading docker image they provide to the model, is a README which basically just says this is ffmpeg and docs can be found online. They do not allow models to get online.

So a model is supposed to reverse-engineer a blackbox using only limited number of tries. I'm not sure even ASI can do this under these constraints (without memorizing the ffmpeg code base, obviously.)

In the only posts one of authors mentions "usage docs". Obviously they had a command-line tool like `grep` in mind -- where a man page sort-of specifies program behavior. But then added sqlite, ffmpeg, php, etc. - where a usage doc is like one millionth of information you need to implement ffmpeg.

And, of course, there's no human baseline. I'd guess making such a baseline would cost billions of dollars.

r123anH 21 hours ago [-]

Ahem. Bitkeeper and Samba were reverse engineered just from the protocol by humans. For free.

killerstorm 21 hours ago [-]

SMB is a rather basic RPC protocol with just a handful of different types of calls. It's many orders of magnitude simpler than audio/video compression formats.

groby_b 18 hours ago [-]

ffmpeg docs include none of the protocols/file formats.

thomashop 22 hours ago [-]

i thought the agent can execute real ffmpeg to compare

killerstorm 22 hours ago [-]

I think you underestimate complexity of audio & video encoding standards. There are hundreds and hundreds of pages of specification. How many times do you need to execute real ffmpeg to get all tiny details?

It's certainly possible to reverse-engineer it from a blackbox access, but it would take *years* and this test has a time limit.

astrange 16 hours ago [-]

ffmpeg also includes many formats with no standards that were reverse-engineered in the first place.

GorbachevyChase 19 hours ago [-]

Even given that I think solving the problem would require a certain amount of personal agency and volition to drive useful experimentation, and then you still have an inescapable problem that a design process is never verifiably done; it just a sense of taste when a product is good enough and it’s time to stop working on it.

I’m not sure this benchmark is even very interesting because it requires a language model do something that it really cannot do. Maybe it would be possible with a novel harness in an ensemble system, but I would never expect a pure language model that is run in a minimal harness to ever be able to do this.

tadamcz 1 days ago [-]

Nice work once again from Ofir Press and team; this seems to be an idea that's in the air.

> Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task

Fwiw, this is very different from what we find in MirrorCode:

> Opus 4.6 successfully reimplements almost every program up to gotree’s size in our benchmark.

https://epoch.ai/blog/mirrorcode-preliminary-results

I don't have time right now to dig in to what could explain the difference (I'm working hard on getting the full MirrorCode out as soon as possible). But I suspect that the ProgramBench authors are either under-eliciting the AIs, or their tasks are unfair/impossible given the constraints, or both.

I hope to look more into it after releasing MirrorCode, and write up my conclusions.

VladVladikoff 1 days ago [-]

I would love to try this out. I have a horrible legacy project that is written in angular by a really amateur developer, full of huge blocks of copy pasted code that has minor modifications in each block. I’ve tried before to get an LLM to rewrite it to something more sensible, but I have not succeeded, usually it just ends up breaking everything. Is there a guide or some system to follow? What’s the best way to accomplish a task like this?

rurban 6 hours ago [-]

Normal engineering practices as thought since the 70ies.

Break the problems up into manageable pieces. Make a plan, have tests to verify the outcome, implement that part. Rinse and repeat. Have integration tests.

jaggederest 15 hours ago [-]

I think one way is to take the existing system in something like a docker container or equivalent, some kind of black box, and write tests against it in pure HTTP calls or using browser automation to record (can drive it with AI). When you've reached a truly massive test suite that covers everything, you delete the container and use the test suite as an oracle for writing a new version (open book, the AI can look at the test suite but not change it).

This is a tactic based on things I have read in "Working Effectively with Legacy Code" by Michael Feathers - he discusses using cut points to build a testing firewall to bring code under test, then gradually expanding the test suite from that beachhead of confirmed interface.

astrange 16 hours ago [-]

I've been very successful so far using Sonnet 4.6 (1M) as the basic model in Claude Code, plus Codex and gemini-review plugins for second/third opinions. (The last one is somewhat busted and hardcoded old gemini versions, I should patch it up.)

I needed to use Opus 4.7 for one project because it used very recent APIs, and it certainly is smart but it's also very expensive.

phpnode 23 hours ago [-]

I have an approach that can handle this if you're interested? My email is in my profile.

stingraycharles 1 days ago [-]

Problem with these types of benchmarks is that it’s 100% certain the LLM has been trained on all that code already, so they’re all tainted since you don’t know whether it’s just benchmarking recall vs actual reasoning.

Same with SWE-bench and others.

tadamcz 21 hours ago [-]

I agree it's a potentially big problem, affecting almost any benchmark out there. We discuss it briefly in "Appendix A: Contamination and memorization" https://epoch.ai/blog/mirrorcode-preliminary-results#appendi....

Ideally one would do these benchmarks with held-out proprietary software, but that comes with many practical concerns.

Zigurd 22 hours ago [-]

That's a feature not a bug. It doesn't make benchmarking any more meaningful or simple, but being trained to recall patterns is a legitimate goal for a coding agent.

stingraycharles 10 hours ago [-]

Yes but then the benchmarks need to be presented as "this verifies whether the model can recall this exact same situation and does not actually benchmark any reasoning at all".

This is not the case, they're being presented as "how good is the model at software engineering". E.g. the benchmark in question says this:

"Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. "

When your benchmark is fundamentally embedded extremely well in the training data, such that you're actually just benchmarking "how well do you remember what sqlite looks like" rather than "do you understand all the tradeoffs, risks, design decisions that need to be made to build a bespoke database from scratch".

This is a VERY big caveat that, to me, for a decent part explains the discrepancy between benchmarks and reality.

grego 24 hours ago [-]

Is anyone familiar with gotree? That was mentioned as the most complex piece of code, but the metric was LOC. Based on the high level description gotree might be closer to a set of small programs / algorithms.

Interesting anyway. It will be nice to see these comparisons with open weight models and how do those fare.

tadamcz 23 hours ago [-]

There's a more detailed description in "Appendix B: Qualitative discussion of the gotree task"

https://epoch.ai/blog/mirrorcode-preliminary-results#appendi...

tadamcz 23 hours ago [-]

I should say one big difference is ProgramBench has 200 target programs while MirrorCode has about 30. We did many manual things to ensure task quality, that would have required huge resources to do at ProgramBench scale.

jerf 22 hours ago [-]

"But I suspect that the ProgramBench authors are either under-eliciting the AIs, or their tasks are unfair/impossible given the constraints, or both."

I'd go with "impossible":

"Given a gold (reference) executable and its usage documentation, a task worker is asked to write source code and a build script that constructs a candidate executable which should reproduce the behavior of the gold executable."

The test cases are built from an AI doing an examination of the source code and producing test cases, and later text also confirms that the AI during the production phase can't read the original executable so it can't reverse engineer it directly, so the test cases are being drawn from a situation where the tester has vastly more knowledge of the program than the implenter.

That is a losing scenario for anyone, be they human, modern AI, or even some hypothetical perfect programmer. Take ffmpeg as an extreme example. The documentation does not even remotely specify the program. Entire codecs can be missed at a stroke, and each of those codecs is itself a rich set of features that may or may not be used in a given input or output file, but the final tests can freely draw from any of those things. And trying to implement a codec from just some input and output would strain anyone, especially when the input is all but certain to not be sufficiently broad to make the determination for sure.

That sort of issue extends all the way down to even some tiny command-line programs I've written myself. The end-user documentation is never a specification. That's not what end-user documentation is. And even if you did hand the AI all relevant specifications you'd still get an implementation of the specification, but anyone who has ever implemented a non-trivial specification into real-world situations can tell you all about how even the spec is never enough.

I think that's an absolutely ridiculous test. If you handed to me as a human I would simply refuse because I'd tell you straight up front that it is plainly obvious I'm going to utterly and completely fail, so why even bother with the time to try?

LeCompteSftware 1 days ago [-]

Surely the biggest difference is that you guys are mostly testing LLMs on simpler utilities, mostly involving higher-level languages, whereas ProgramBench are all very complex C programs (and much older programs with much more comprehensive test cases).

Eg cal is totally routine. I would expect most sophomores to be able to write a perfectly good cal. In fact the only program you tested which actually has anywhere close to the complexity of SQLite or FFmpeg is is Pkl, and it looks like Opus 4.6 totally failed.

I think your results are consistent. You're just measuring different things. Your benchmarks mostly tests LLMs ability to write technically routine programs of moderate length - yes the bioinformatics package involves specialized domain knowledge, but not specialized Go engineering. ProgramBench is harder.

tadamcz 1 days ago [-]

I don't think so. ProgramBench authors say no LLMs fully resolve any task, i.e. even the easiest tasks in their benchmark are unsolved. Whereas we found Opus 4.6 successfully reimplements almost every program up to gotree’s size (around 15-20 of them).

For Pkl, the preliminary results only went up to 1bn total tokens (costing $550, which would be cheap if LLMs could do the task). It might very well be solved at higher token budgets; see the report for more discussion of this.

The preliminary results are just on 4 targets. We have several Pkl-level and harder tasks in the full set which we're releasing soon.

In the following quote multiple things are not quite right:

> mostly involving higher-level languages, whereas ProgramBench are all very complex C programs (and much older programs with much more comprehensive test cases).

First, as I said above I think you're confusing the top-end of ProgramBench difficulty with the average. The quote in the OP is pretty clear that FFmpeg, SQLite, and PHP are the 3 hardest out of 200 in ProgramBench, and the bottom end is "compact CLI tools".

Second, I don't see the relevance of C vs higher-level languages, how does this make ProgramBench harder?

Third, for the test cases, I think you might be labouring under a misapprehension about how MirrorCode works? MirrorCode uses end-to-end tests from a variety of sources (the original program’s test suites, real-world data, and LLM-assisted generation). End-to-end means the stdout/stderr has to match exactly for each test case.

tadamcz 24 hours ago [-]

> Eg cal is totally routine. I would expect most sophomores to be able to write a perfectly good cal.

This is incidental to the main disagreement, but btw I also doubt this.

Let's try and make the claim more precise. e.g. are you saying the average university undergraduate studying CS would reimplement cal from scratch (only stdlib), matching the output perfectly for all 1365 MirrorCode test cases, in (say) 3 days of full-time work (without AI assistance obviously)? I'd bet against it!

Here is the manual for the cal that we use: https://media.githubusercontent.com/media/epoch-research/Mir...

You can also look at a full transcript of an LLM solving the task: https://epochai-public-eval-logs-manual.s3.amazonaws.com/eva...

The data is here: https://github.com/epoch-research/MirrorCode-data/

LeCompteSftware 22 hours ago [-]

I didn't say "3 days of full-time work," that is totally unreasonable. I was giving them basically unlimited time to do whatever slow testing and research they needed. And let me qualify my statement: when I say "I would expect most sophomores to be able to do this," I mean "if most sophomores can't do this then their university is badly failing them." (If you want to split hairs about modern undergrads not learning C then I think this conversation is over.)

Of course it would take them a while to learn facts about datetime that the LLM doesn't need to learn. If your argument is about cost optimization then congrats, you win. The point is that it doesn't take a huge amount of C expertise to do this successfully - the standard implementation is nothing you wouldn't see in K&R: https://raw.githubusercontent.com/util-linux/util-linux/refs... It's routine.

But a nontrivial database, even a simple one like SQLite, really does require professional-level C expertise. It is not routine. So your comparison to ProgramBench still seems apple-to-oranges.

tadamcz 22 hours ago [-]

I think we're talking past each other here...

24 hours ago [-]

_pdp_ 1 days ago [-]

I am not surprised but this one sticks out...

> Models favor monolithic, single-file implementations that diverge sharply from human-written code.

Well, all of our code is monolithic with some files close 20K lines of code and we do use coding agents - not for the original code but as of late. I've always had that hunch that splitting everything into tiny files does not improve AI coding agent performance although it feels counterintuitive due to model context constraints.

To me the important parts of a program should be clustered together so the implementation is obvious. Scattering the implementation in various files all over the source tree does not help much building the mental model.

That also closely match how software used to be written in the past too.

Garlef 1 days ago [-]

> Scattering the implementation in various files all over the source tree

If you treat the source tree seriously, you can communicate a lot with how it is structured

_pdp_ 1 days ago [-]

Well you can communicate organisation structure but not logic or intent. The directory is a tree and the Code is a graph.

You can communicate some information by looking at the org chart of a company but it does not really tell you much how it works.

Arguably a coding agent is less concerned about where the files are at then the code itself.

huflungdung 1 days ago [-]

[dead]

BurningPenguin 1 days ago [-]

Kinda surprising to me, since i had some trouble with Cursor & Co. once the file went over ~800 lines. It repeatedly failed to edit it, until i split it up into multiple logical components. As it should have been from the beginning...

Though, it was some time ago, so things might have improved?

_pdp_ 1 days ago [-]

VSCode basically any model can edit the 20K file without any issues. The coding harness does not read the entire file at once though. It reads chunks of it so the size does not really matter. What matters is how close are the things the agent needs to make the edit.

tnelsond4 1 days ago [-]

Yeah, that was my experience with Grok, whenever I gave it a file with over 400 lines it would just fail to comprehend it or be too lazy to write too much at a time. Splitting stuff up into separate files helped.

librasteve 1 days ago [-]

this is a big frustration for web code what with HTML, CSS, JS, PHP all spread about

https://htmx.org/essays/locality-of-behaviour/ is a good fight back as exemplified in many stacks, eg https://harcstack.org

doix 1 days ago [-]

> Scattering the implementation in various files all over the source tree does not help much building the mental model.

Yeah, that happens where I work and I hate it. A combination of lint rules and AI reviewer prompts complain about long files and long functions. This means something that could be a 300 line self contained function that could be read linearly, gets split up into 6 functions across 6 files.

It's the illusion of "clean code". If you're casually skimming the code, you feel good. But as soon as you go beyond the surface level it becomes annoying.

logicchains 1 days ago [-]

> Models favor monolithic, single-file implementations that diverge sharply from human-written code.

This isn't the case if models are prompted to actually plan the file architecture beforehand, it's only the case if they're given a dumb monolithic "code this thing" prompt.

adrian_b 1 days ago [-]

> Open internet with cheating detection => cheating is widespread, 20-36% of tasks are flagged for the stronger models, with source code lookup accounting for the majority of the violations.

Therefore:

> blocking internet access entirely is the appropriate default for ProgramBench

The fact that your Anthropic coding assistant has a tendency to search on the Internet code to be inserted into your program may count for an additional copyright violation (besides the possibility of reproducing recognizable fragments of its training data).

(I do not agree that copyright, at least in its current form, should be applicable to computer programs, but it is weird that the same companies who try to exploit copyrights against others also insist on the use of coding assistants that are a workaround against copyright laws, which is the main reason why they can increase programming productivity, because they may cut and paste code that you are not allowed to copy yourself.)

whattheheckheck 23 hours ago [-]

If a photo cannot be copyrighted then dark factory code wont be either.

adrian_b 15 hours ago [-]

The output of a coding assistant cannot be copyrighted, but it may contain code from which the copyright has been removed and which is used in a manner incompatible with the original license.

Even the more permissive licenses, like BSD, MIT, etc., forbid the removal of the copyright notice when the code is reused.

While this may also happen with the source programs used for training, I was not aware about the behavior described in TFA for the Anthropic agents, which may search the Internet for source code applicable to the problem that must be solved. It seems even more likely that such code will not be used as allowed by its license.

endymi0n 1 days ago [-]

[dead]

miguel_martin 1 days ago [-]

It’s unfortunate that they didn’t eval using subagents/orchestration for such a complex set of tasks (from what I can tell), e.g. analyze program to produce initial spec -> code -> review and rinse&repeat with each of those steps being a separate subagent allocated

I would be interested to see if there’s a significant quantifiable difference.

NitpickLawyer 1 days ago [-]

This might actually be the whole value prop of this benchmark. Forget their initial scores, take open models (so we can be sure the base doesn't change), and test different combinations of harness + prompts + strategies + whatever memthing is popular today. See if the scores improve. Repeat.

andy12_ 1 days ago [-]

It's interesting that Figure 4 shows that Sonnet and Opus have a very clear distinct curve from all other models, even from GPT 5.4. Anthropic superiority I guess.

vatsachak 1 days ago [-]

In before "but they did not use my agent swarm"

red75prime 1 days ago [-]

In science N=1 is statistically insignificant. In business it might mean that you have a product.

makerofthings 1 days ago [-]

It’s the annoying thing about AI. If it works, the AI is magic. If it doesn’t work, you’re using it wrong.

riffraff 1 days ago [-]

It was the same thing with OOP, TDD, agile development, C, C++, Rust, ORMs..

Whenever something impacts a ton of people you will get some who gain a lot from it and some who don't, and they're generally unable to relate to the other side.

Maybe the thing works in some domain and not the other. Maybe the two groups are doing different things. Maybe the context around it is different. Maybe they have a different definition of "better".

I think it helps to keep an open mind and not grow attached to either position, but rather inquire, "well we did X with outcome Y, what did you do instead?"

NitpickLawyer 1 days ago [-]

So, would you change your view if someone else runs this bench w/ a different harness and gets better results?

brunoborges 20 hours ago [-]

I wonder if a model that does not know anything about a hypothetical programming language X, could write code once given said language X specification, APIs, and SDK tools and their documentation.

Meaning: the model has no idea, no access to examples, no previous codebase trained on, nothing, for language X. But it knows English, it knows how to program in general (training data does contain other programming languages), and everything we expect from LLMs today. It just doesn't know jack about language X.

sigmar 22 hours ago [-]

Neat research. I find figure 11 interesting. The models behave so differently there.

imo the benchmark should be named Can_It_Pull_a_CharDet_Bench

behaviors 1 days ago [-]

It's funny, because that task is very diverse. Any LLM will use the codebase given as a template(At least in free-tier models)

My software as a contract of behaviors works like a program bench(I even cross tested buildouts) Made an entire corpus layout for multi agent multi platform builds to be compared. Even went ahead and ran 50 contracts for an example. It honestly showed improvable areas, and distinct differences between model code.

{contract_name}/ └── submissions/ └── {date}_{os}_{agent}_{model}_{stack}/ ├── {contract}.osc.md ├── osc.osc.md └── results/ └── {contract}.snapshot.json That's it, compare to the same contract, or find a new contract to use to compare. Lot's of signed/hash pinned files are all you need to reproduce software from nothing, with an LLM.

Programbench is close to that(they have a nice paper/article here. But I don't like the work used. Having software to start with is not a bench of making code but reverse engineering.

github/s1ugh34d/osc

srijanshukla18 22 hours ago [-]

This is not a serious benchmark, come on.

Tomorrow I'm launching a benchmark where I check if an LLM can build a Airbus A320 from scratch without internet. (Spoiler: no LLM succeeds)

luca-ctx 1 days ago [-]

RE: monolithic, single-file implementations

We have a lint that caps source code files at 650 LOC and it works really well.

1 days ago [-]

themafia 16 hours ago [-]

Suggested alternative title for the paper:

Can American corporate desires finally kill community based open source once and for all?

I mean, it seems clear to me, companies hate the GPL, and they're willing to play these games to try to get that code into their hands under the MIT license and they're happy to use these thinly disguised methods to get it. I see all these absurd ideas as part and parcel of this larger strategy.

I find the current state of affairs disgusting.

arian_ 14 hours ago [-]

[flagged]

keyle 1 days ago [-]

How long until AI is not even writing code but producing machine code?

Think about it, all these compilers, tooling, what a waste!

I imagine a future where chipset makers will provide a model you can just prompt to "act upon that chipset" and voila, "You're absolutely right! Here is your binary."

We won't be developers, we won't be devops, we'll be rollmops! /s

_pdp_ 1 days ago [-]

Coding agents can write ASM. But if you mean writing the actual byte-code that will require a very different approach at a very different level of abstraction that LLMs are not designed to do. Keep in mind that all LLMs are trained first on text and then fine-tuned on code.

in-silico 21 hours ago [-]

> Keep in mind that all LLMs are trained first on text and then fine-tuned on code.

No, they are trained on a mixture of text and code from the start.

keyle 1 days ago [-]

Good point! Long live ASM! Wasm everything!!1 /jk

vrganj 1 days ago [-]

Good luck reasoning about the output in any meaningful way then. AI introduces a bug? Well, you're fucked.

keyle 1 days ago [-]

Welcome to the future!

quinnjh 1 days ago [-]

My hunch is that it would take years of hundreds of thousands of developers working with machine code, posting stackoverflow questions with machine code, and publishing github repos written on it with documentation. Thats all the free labor LLMs leveraged to use high level langs.

>We won't be developers, we won't be devops, we'll be modelops! /s

I can still see this happening with higher level langs. the thing is the compiler is not replaced in the training data, more likely LLMs will give rise to semideterministic layers on the compilers

I could see nvidia achieving this first with how nice the devex is with CUDA

osti 1 days ago [-]

I heard they are already proficient at assembly languages.

aforwardslash 1 days ago [-]

They are - probably more proficient than with some high-level languages. I've used it for embedded stuff, including TI sitara PRU assembly, with great results. Frontier models can also easily "learn" directly from the manuals; asm is quite easy for them to pick up due to its "flat" (non-structured) nature.

qsera 1 days ago [-]

>Frontier models can also easily "learn" directly from the manuals;

Really? So you just include the manual in the context? Or how does that work?

aforwardslash 15 hours ago [-]

Yes, something like "analyze throroughly the @datasheet.pdf and create a plan to implement x"

ForOldHack 17 hours ago [-]

They are proficient at IBM 370 ASM, but need a lot of help, and a professional level disassembler for x64/x86, and in the converse, if you watch HackerNews, AI is poor/expensive for creating compilers, but it has extensive reference materials for MC68K, Itanium, SPARC, and ARM. This is where Gemini and Co-pilot become good co-pilots.

LeCompteSftware 1 days ago [-]

FWIW I think "LLMs are semideterministic" is something of a red herring. The real difference between LLM codegen and compilers is that compilers output logically the same assembly regardless of the variable names. If you're numerically solving a differential equation the compiler does not care if the floats represent heat through a pipe or dollars through a brokerage. Compilers don't care about semantic meaning, that concern is totally separated.

But even if its putatively implementing the same algorithm, LLMs certainly do not output basically the same finance Python as they would mechanical engineering Python. The style will be a little different. Sometimes the performance/clarity tradeoffs will be different. Sometimes it'll be fairly fancy and object-oriented, other times it'll be more low-level "objects are just dicts."

It's way more than a higher abstraction layer: LLM codegen involves a nontechnical tangling of concerns that doesn't exist with even the hoitiest-toitiest proof-checking compilers. It's a complete sea change. I find it incredibly disconcerting... for the same reason, by the way, that assembly programmers found Fortran and C disconcerting, and continued to reliably find employment for a good 40 years after higher-level languages were invented :) Actually even today. The assembly programmers who got hosed by C tended to be electricians who learned on the job - it's kind of cool to read old manuals from the 70s, carefully (and correctly!) explaining to electricians that a computer program is essentially an ephemeral circuit.

But I think there are specific skills around scientific thinking (learned at a formal college) and engineering carefulness (learned via hard knocks) that aren't going anywhere.

Rendered at 11:33:01 GMT+0000 (Coordinated Universal Time) with Vercel.