You’re missing the reason why most code isn’t hand-optimized: it doesn’t matter. If you’re writing a server, most of the time waiting for a request, or waiting for a database query, or waiting on a file to read, or whatever. There is no point in optimizing these further. The only place where that matter is for high compute-bound processes. If you aren’t processing more than 10k records at a time, you’re unlikely to see much a difference. Even still, you can likely optimize your high-level code to get most of the gains you’d get from jumping down to asm.
Pre-mature optimization is a waste of time. That’s time that can better be spent working on a new problem. It’s better to use observability to know where your code is actually taking the most time in production, and then optimizing those places, if they need it at all.
Your comment makes it sound a bit like there is no need for performance, but taking servers or REST services as an example, most programmers care about throughput, and almost all about latency which are both measured with e.g. prometheus. When your website takes one more second to load you lose clients, and if your code is slow it shows up on the cloud provider's bill. Even if you are IO bound, you can batch requests, go async, or do less IO.
The reason people don't bother hand-optimizing code is because the hardware is really fast, and because a handful of programmers put a lot of efforts writing optimizing compilers and optimized frameworks so the average output is good enough for the average workload.
I’m not saying that there is no need to optimize for performance for REST-like servers, instead I am saying that it’s very dependent on the specific use case, which is difficult to predict. Often it can be more economical to scale up when there isn’t sufficient throughput, and to focus engineering optimization efforts on only those queries that are low-performing. Even then, there are typically optimizations to be made long before one reaches for assembly. Optimizing SQL queries, and efforts to increase parallelization are often sufficient.
For instance, the server that I work on is a GraphQL API written in Typescript. I have a few million users, and it runs without problems. When I have had slow queries, I typically need to optimize the SQL/Prisma queries, two times I needed to optimize parallelization. We’re not particularly computer-bound. So I haven’t yet even needed to offset processing to even a compiled language. NodeJS is simply fast enough.
I disagree here. It is reasonably easy to mix assembler and C if there’s a clear reason for doing it.
Examples:
Software defined radio doing vector operations for the performance critical digital filter. Now, gnuradio is having to do an excitingly difficult version of this because:
a. The software has to work in Intel. ARM, MIPS, RISC-V.
b. Which vector operations the cpu supports is only known at run time.
So here, the performance critical routines have to e it’s not just for each target architecture, but also in multiple versions for each target architecture depending on level of vector support. (Does it support RISC-V vectorization of floating point? Does it support RISC-V vectorization of bit manipulation?)
And so at run time you need to substitute in the appropriate implementation depending on which cpu features thr kernel reports. But it’s all do-able.
But your general point is probably valid.
inline ASM doesn't seem that bad. I don't write much performance critical CPU code but it's very common in performance critical GPU code.
I suppose one important difference is that people usually don't read assembly/compiled binaries but they do proofread AI generated code (at least most claim to). I think it would be easier to couple manual code with LLM generated, marking it via some in line comment to force the assistant to ignore it or ask for permission before changing anything there compared to inserting assembly into compiled code (plus non-assembly code should be mostly hardware independent). This suggests human level enhancements are going to stay feasible and coding assistants have larger gap to close than compilers did before removing 99.99% of lower level coding.
The LLVM compiler has some extremely exciting code that identifies if it is compiling an implementation of popcount(), and if so substitutes in an llvm-ir primitive for popcount, which will get compiled down to a popcount instruction if the target has one.
As I said, this code is very entertaining.
Really, I ought to extend it so it also recognizes a different common way of implementing popcount, for reasons of getting better scores in some commonly used benchmarks. (Changing the benchmark? Clearly cheating. Extending the compiler so it recognises a code sequence in a common benchmark? Slightly sketchy.) But really, I can’t be bothered to write a PR against that horrific piece of compiler code.
I hope that when episodic memory is cracked (ie cheap), it helps with this all-or-nothing problem. That would be really great.
Upvote but I'm not sure I understand or agree with the thesis. Programming in C is already pretty niche, and the amount of code that is worth the tradeoff to hyper-optimize (cost to do, to maintain, to re-optimize with new host architectures or microcode optimizations, etc.) is absolultely tiny, and getting smaller all the time.
For most of the tasks where this would be beneficial, the focus has shifted over the last few decades from performance to correctness. The move isn't from C to ASM, but from C to Rust or from validation in C (TLA+ for design, bounded model checker for code).
There still is a place for human optimization based on use cases the compiler-optimizer can't see, but it's small and shrinking.
I haven't seen any papers on this, but I'd expect modern coding agents to write ASM that's more correct AND more performant for optimized subroutines than the vast majority of humans. Really, any optimization for something small enough that you can write benchmarks for and measure improvements, automation is going to win.
My thesis is approximately "we don't write assembly because it usually doesn't provide much practical benefit and also it's obnoxious to do". This is in opposition to the thesis "we don't write assembly because computers have surpassed the abilities of all but the best humans and so human intervention would only make the output worse".
I think this is an important point because some people seem to be under the impression that "LLMs can write better code than pretty much all humans" is a necessary prerequisite for "it's usually not worth it for a human to write code", and also operating under the model of "once LLMs write most code, there will be nothing left to do for the people with software development skills".
"once LLMs write most code, there will be nothing left to do for the people with software development skills".
is a mismatch of quantifiers. If LLMs write most code, there's no need for most of the people with those software development skills that are necessary and which LLMs can do well enough. That doesn't say ANYTHING about the software development skills which LLM's cannot do well enough.
I can't tell if you're just saying "LLMs can't do this part well, yet", or if you're asserting that humans have some ability in assembler that LLMs won't match in the foreseeable future.
The point with assembler in drawing the analogy "assembly programmers : optimizing compilers :: programmers-in-general : scaffolded LLMs". The post was not about any particular opinions I have[1] about how LLMs will or won't interact with assembly code.
As optimizing compilers became popular, assembly programmers found that their particular skill of writing assembly code from scratch was largely obsolete. They didn't generally beome unemployed as result though. Instead, many of the incidental skills they picked up along the way[2] went from."incidental side skill" to "main value proposition".
I do have such opinions, namely "LLMs mostly won't write as for basically the same reasons humans don't write much asm". But that opinion isn't super relevant here.
e.g. knowing how to read a crash dump or which memory access patterns are good or just general skill at translating high-level descriptions of program behavior into a good data model and code the correctly operate on those data structures
I don't know what the analogue is here. Better scaffolding? More capable LLM?
Fine tuning on examples of doing the thing you want, or making sure your new trick gets into datasets? Even more high effort than the compiler version for the amount of improves it?
One implication of this worldview is that as long as there are still some identifiable high-leverage places where humans still write better code than LLMs
I agree (at least on the short term as you point out), but it seems hard to predict what these places will be (and thus hard to prepare for it), and it still seems likely that the market will be tough for the 90% of the programmers that are not experts in the specific niche things AIs are not good at.
Nitpick on the hashmap example: while I agree that the compiler does not produce optimal code in that case (which may be your main point), there is no need to write assembly to get the speed-up you describe, you can iterate the backing array in C. The compiler may or may not generate SIMD code however, so you may want to use SIMD intrinsics which is very close to assembly.
That's entirely fair. And tbh most of the time I'm looking at a hot loop where the compiler did something dumb, the first question I ask myself is "is there some other way I could write this code so that the compiler will recognize that the more performant code is an option". Compilers are really quite good and fully featured these days, so there usually is some code transformation or pragma or compiler flag that will work for my specific use case.
If we can get SC LLMs, this problem would fade away and the initial quote would become 100% true. Also a SC LLM could directly write optimized code in assembler (that would define hypercoder LLM ? And the end of programing languages ?).
There's a take I've seen going around, which goes approximately like this:
I 85% agree with this take.
However, I think there's one important inaccuracy: even today, finding places where your optimizing compiler failed to produce optimal code is often pretty straightforward, and once you've identified those places 10x+ speedups for that specific program on that specific hardware is often possible[1]. The reason nobody writes assembly anymore is the difficulty of mixing hand-written assembly with machine-generated assembly.
The issue is that it's easy to have the compiler write all of the assembly in your project, and it's easy from a build perspective to have the compiler write none of the assembly in your project, but having the compiler write most but not all of the assembly in your project is hard. As with many things in proramming, having two sources of truth leads to sadness. You have many choices for what to do if you spot an optimization the compiler missed, and all of them are bad:
I think most of these strategies have fairly direct analogues with a codebase that an LLM agent generates from a natural language spec, and that the pitfalls are also analogous. Specifically:
One implication of this worldview is that as long as there are still some identifiable high-leverage places where humans still write better code than LLMs[3], if you are capable of identifying good boundaries for libraries / services / APIs which package a coherent bundle of functionality, then you will probably still find significant demand for your services as a developer.
Of course if AI capabilities stop being so "spiky" relative to human capabilities this analogy will break down, and also there's a significant chance that we all die[4]. Aside from that, though, this feels like an interesting and fruitful near-term forecasting/extrapolation exercise.
For a slightly contrived concrete example that rhymes with stuff that occurs in the wild, let's say you do something along the lines of "half-fill a hash table with entries, then iterate through the same keys in the same order summing the values in the hash table"
Like so
Your optimizing compiler will spit out assembly which iterates through the keys, fetches the value of each one, and adds it to the total. The memory access patterns will not be pretty
Example asm generated by
gcc -o3This is the best your compiler can do: since the ordering of floating point operations can matter, it has to iterate through the keys in the order you gave. However, you the programmer might have some knowledge your compiler lacks, like "actually the backing array is zero-initialized, half-full, and we're going to be reading every value in it and summing". So you can replace the compiler-generated code with something like "Go through the entire backing array in memory order and add all values".
Example lovingly hand-written asm by someone who is not very good at writing asm
I observe a ~14x speedup with the hand-rolled assembly here.
In real life, I would basically never hand-roll assembly here, though I might replace the c code with the optimized version and a giant block comment explaining the terrible hack I was doing, why I was doing it, and why the compiler didn't do the code transform for me. I would, of course, only do this if this was in a hot region of code.
Whenever someone says something is "true in some sense", that means that thing is false in most senses.
Likely somewhere between 25 weeks and 25 years
AI capabilities remaining "spiky" won't necessarily help with this