Many can write faster asm than the compiler, yet don't. Why?

faul_sname

There's a take I've seen going around, which goes approximately like this:

It used to be the case that you had to write assembly to make computers do things, but then compilers came along. Now we have optimizing compilers, and those optimizing compilers can write assembly better than pretty much any human. Because of that, basically nobody writes assembly anymore. The same is about to be true of regular programming.

I 85% agree with this take.

However, I think there's one important inaccuracy: even today, finding places where your optimizing compiler failed to produce optimal code is often pretty straightforward, and once you've identified those places 10x+ speedups for that specific program on that specific hardware is often possible^[1]. The reason nobody writes assembly anymore is the difficulty of mixing hand-written assembly with machine-generated assembly.

The issue is that it's easy to have the compiler write all of the assembly in your project, and it's easy from a build perspective to have the compiler write none of the assembly in your project, but having the compiler write most but not all of the assembly in your project is hard. As with many things in proramming, having two sources of truth leads to sadness. You have many choices for what to do if you spot an optimization the compiler missed, and all of them are bad:

Hope there's a pragma or compiler flag. If one exists, great! Add it and pray that your codebase doesn't change such that your pragma now hurts perf.
Inline assembly. Now you're maintaining two mental models: the C semantics the rest of your code assumes, and the register/memory state your asm block manipulates. The compiler can't optimize across inline asm boundaries. Lots of other pitfalls as well - using inline asm feels to me like a knife except the handle has been replaced by a second blade so you can have twice as much knife per knife.
Factor the hot path into a separate .s file, write an ABI-compliant assembly function and link it in. It works fine, but it's an awful lot of effort, and your cross-platform testing story also is a bit sadder.
Patch the compiler's output: not a real option, but it's informative to think about why it's not a real option. The issue is that you'd have to redo the optimization on every build. Figuring out how to repeatably perform specific transforms on code that retain behavior but improve performance is hard. So hard, in fact, that we have a name for the sort of programs that can do it. Which brings us to
Improve the compiler itself. The "correct" solution, in some sense^[2] — make everyone benefit from your insight. Writing the transform is kinda hard though. Figuring out when to apply the transform, and when not to, is harder. Proving that your transform will never cause other programs to start behaving incorrectly is harder still.
Shrug and move on. The compiler's output is 14x slower than what you could write, but it's fast enough for your use case. You have other work to do.

I think most of these strategies have fairly direct analogues with a codebase that an LLM agent generates from a natural language spec, and that the pitfalls are also analogous. Specifically:

Tweak your prompt or your spec.
Write a snippet of code to accomplish some concrete subtask, and tell the LLM to use the code you wrote.
Extract some subset of functionality to a library that you lovingly craft yourself, tell the LLM to use that library.
Edit the code the LLM wrote, with the knowledge that it's just going to repeat the same bad pattern the next time it sees the same situation (unless you also tweak the prompt/spec to avoid that)
I don't know what the analogue is here. Better scaffolding? More capable LLM?

Shrug and move on.

One implication of this worldview is that as long as there are still some identifiable high-leverage places where humans still write better code than LLMs^[3], if you are capable of identifying good boundaries for libraries / services / APIs which package a coherent bundle of functionality, then you will probably still find significant demand for your services as a developer.

Of course if AI capabilities stop being so "spiky" relative to human capabilities this analogy will break down, and also there's a significant chance that we all die^[4]. Aside from that, though, this feels like an interesting and fruitful near-term forecasting/extrapolation exercise.

^{^}
For a slightly contrived concrete example that rhymes with stuff that occurs in the wild, let's say you do something along the lines of "half-fill a hash table with entries, then iterate through the same keys in the same order summing the values in the hash table"
Like so
```
// Throw 5M entries into a hashmap of size 10M
HashMap h;
h->keys = calloc(10000000 * sizeof(int));
h->values = calloc(10000000 * sizeof(double));
for (int k = 0; k < 5000000; k++) {
    hashmap_set(h, k, randn(0, 1));
}

// ... later, when we know the keys we care about are 1..4999999
double sum = 0.0;
for (int k = 0; k < 5000000; k++) {
    sum += hashmap_get(h, k);
}
printf("sum=%.6f\n", sum);
```
Your optimizing compiler will spit out assembly which iterates through the keys, fetches the value of each one, and adds it to the total. The memory access patterns will not be pretty
Example asm generated by gcc -o3
```
...
# ... stuff ...
                                        # key pos = hash(key) % capacity
.L29:                                   # linear probe loop to find idx of our key
    cmpl    %eax, %esi
    je      .L28
    leaq    1(%rcx), %rcx
    movl    (%r8,%rcx,4), %eax
    cmpl    $-1, %eax
    jne     .L29
.L28:
    vaddsd  (%r11,%rcx,8), %xmm0, %xmm0  # sum += values[idx]
# ... stuff ...
```
This is the best your compiler can do: since the ordering of floating point operations can matter, it has to iterate through the keys in the order you gave. However, you the programmer might have some knowledge your compiler lacks, like "actually the backing array is zero-initialized, half-full, and we're going to be reading every value in it and summing". So you can replace the compiler-generated code with something like "Go through the entire backing array in memory order and add all values".
Example lovingly hand-written asm by someone who is not very good at writing asm
```
# ... stuff ...
.L31:
    vaddsd  (%rdi), %xmm0, %xmm0
    vaddsd  8(%rdi), %xmm0, %xmm0
    vaddsd  16(%rdi), %xmm0, %xmm0
    vaddsd  24(%rdi), %xmm0, %xmm0
    addq    $32, %rdi
    cmpq    %rdi, %rax
    jne     .L31
# ... stuff ...
```
I observe a ~14x speedup with the hand-rolled assembly here.
In real life, I would basically never hand-roll assembly here, though I might replace the c code with the optimized version and a giant block comment explaining the terrible hack I was doing, why I was doing it, and why the compiler didn't do the code transform for me. I would, of course, only do this if this was in a hot region of code.
^{^}
Whenever someone says something is "true in some sense", that means that thing is false in most senses.
^{^}
Likely somewhere between 25 weeks and 25 years
^{^}
AI capabilities remaining "spiky" won't necessarily help with this

You’re missing the reason why most code isn’t hand-optimized: it doesn’t matter. If you’re writing a server, most of the time waiting for a request, or waiting for a database query, or waiting on a file to read, or whatever. There is no point in optimizing these further. The only place where that matter is for high compute-bound processes. If you aren’t processing more than 10k records at a time, you’re unlikely to see much a difference. Even still, you can likely optimize your high-level code to get most of the gains you’d get from jumping down to asm.

Pre-mature optimization is a waste of time. That’s time that can better be spent working on a new problem. It’s better to use observability to know where your code is actually taking the most time in production, and then optimizing those places, if they need it at all.

Your comment makes it sound a bit like there is no need for performance, but taking servers or REST services as an example, most programmers care about throughput, and almost all about latency which are both measured with e.g. prometheus. When your website takes one more second to load you lose clients, and if your code is slow it shows up on the cloud provider's bill. Even if you are IO bound, you can batch requests, go async, or do less IO.

The reason people don't bother hand-optimizing code is because the hardware is really fast, and because a handful of programmers put a lot of efforts writing optimizing compilers and optimized frameworks so the average output is good enough for the average workload.

I’m not saying that there is no need to optimize for performance for REST-like servers, instead I am saying that it’s very dependent on the specific use case, which is difficult to predict. Often it can be more economical to scale up when there isn’t sufficient throughput, and to focus engineering optimization efforts on only those queries that are low-performing. Even then, there are typically optimizations to be made long before one reaches for assembly. Optimizing SQL queries, and efforts to increase parallelization are often sufficient.

For instance, the server that I work on is a GraphQL API written in Typescript. I have a few million users, and it runs without problems. When I have had slow queries, I typically need to optimize the SQL/Prisma queries, two times I needed to optimize parallelization. We’re not particularly computer-bound. So I haven’t yet even needed to offset processing to even a compiled language. NodeJS is simply fast enough.

I disagree here. It is reasonably easy to mix assembler and C if there’s a clear reason for doing it.

Examples:

Kernel code doing exciting stuff with the Translation Lookahead Buffer. That kind of low-level stuff if too low level even for C, but the relevant kernel code has inline assembler.
Software defined radio doing vector operations for the performance critical digital filter. Now, gnuradio is having to do an excitingly difficult version of this because:
a. The software has to work in Intel. ARM, MIPS, RISC-V.
b. Which vector operations the cpu supports is only known at run time.

So here, the performance critical routines have to e it’s not just for each target architecture, but also in multiple versions for each target architecture depending on level of vector support. (Does it support RISC-V vectorization of floating point? Does it support RISC-V vectorization of bit manipulation?)
And so at run time you need to substitute in the appropriate implementation depending on which cpu features thr kernel reports. But it’s all do-able.

But your general point is probably valid.

Most programmers cant be bothered to write assembler unless there’s a really big performance gain to be had
This is likely to become true for writing high level languages instead of prompts. Too much work for too little gain, except in rare cases.

inline ASM doesn't seem that bad. I don't write much performance critical CPU code but it's very common in performance critical GPU code.

I suppose one important difference is that people usually don't read assembly/compiled binaries but they do proofread AI generated code (at least most claim to). I think it would be easier to couple manual code with LLM generated, marking it via some in line comment to force the assistant to ignore it or ask for permission before changing anything there compared to inserting assembly into compiled code (plus non-assembly code should be mostly hardware independent). This suggests human level enhancements are going to stay feasible and coding assistants have larger gap to close than compilers did before removing 99.99% of lower level coding.

The LLVM compiler has some extremely exciting code that identifies if it is compiling an implementation of popcount(), and if so substitutes in an llvm-ir primitive for popcount, which will get compiled down to a popcount instruction if the target has one.

As I said, this code is very entertaining.

Really, I ought to extend it so it also recognizes a different common way of implementing popcount, for reasons of getting better scores in some commonly used benchmarks. (Changing the benchmark? Clearly cheating. Extending the compiler so it recognises a code sequence in a common benchmark? Slightly sketchy.) But really, I can’t be bothered to write a PR against that horrific piece of compiler code.

I hope that when episodic memory is cracked (ie cheap), it helps with this all-or-nothing problem. That would be really great.

Upvote but I'm not sure I understand or agree with the thesis. Programming in C is already pretty niche, and the amount of code that is worth the tradeoff to hyper-optimize (cost to do, to maintain, to re-optimize with new host architectures or microcode optimizations, etc.) is absolultely tiny, and getting smaller all the time.

For most of the tasks where this would be beneficial, the focus has shifted over the last few decades from performance to correctness. The move isn't from C to ASM, but from C to Rust or from validation in C (TLA+ for design, bounded model checker for code).

There still is a place for human optimization based on use cases the compiler-optimizer can't see, but it's small and shrinking.

I haven't seen any papers on this, but I'd expect modern coding agents to write ASM that's more correct AND more performant for optimized subroutines than the vast majority of humans. Really, any optimization for something small enough that you can write benchmarks for and measure improvements, automation is going to win.

My thesis is approximately "we don't write assembly because it usually doesn't provide much practical benefit and also it's obnoxious to do". This is in opposition to the thesis "we don't write assembly because computers have surpassed the abilities of all but the best humans and so human intervention would only make the output worse".

I think this is an important point because some people seem to be under the impression that "LLMs can write better code than pretty much all humans" is a necessary prerequisite for "it's usually not worth it for a human to write code", and also operating under the model of "once LLMs write most code, there will be nothing left to do for the people with software development skills".

"once LLMs write most code, there will be nothing left to do for the people with software development skills".

is a mismatch of quantifiers. If LLMs write most code, there's no need for most of the people with those software development skills that are necessary and which LLMs can do well enough. That doesn't say ANYTHING about the software development skills which LLM's cannot do well enough.

I can't tell if you're just saying "LLMs can't do this part well, yet", or if you're asserting that humans have some ability in assembler that LLMs won't match in the foreseeable future.

The point with assembler in drawing the analogy "assembly programmers : optimizing compilers :: programmers-in-general : scaffolded LLMs". The post was not about any particular opinions I have^[1] about how LLMs will or won't interact with assembly code.

As optimizing compilers became popular, assembly programmers found that their particular skill of writing assembly code from scratch was largely obsolete. They didn't generally beome unemployed as result though. Instead, many of the incidental skills they picked up along the way^[2] went from."incidental side skill" to "main value proposition".

^{^}
I do have such opinions, namely "LLMs mostly won't write as for basically the same reasons humans don't write much asm". But that opinion isn't super relevant here.
^{^}
e.g. knowing how to read a crash dump or which memory access patterns are good or just general skill at translating high-level descriptions of program behavior into a good data model and code the correctly operate on those data structures

I don't know what the analogue is here. Better scaffolding? More capable LLM?

Fine tuning on examples of doing the thing you want, or making sure your new trick gets into datasets? Even more high effort than the compiler version for the amount of improves it?

One implication of this worldview is that as long as there are still some identifiable high-leverage places where humans still write better code than LLMs

I agree (at least on the short term as you point out), but it seems hard to predict what these places will be (and thus hard to prepare for it), and it still seems likely that the market will be tough for the 90% of the programmers that are not experts in the specific niche things AIs are not good at.

Nitpick on the hashmap example: while I agree that the compiler does not produce optimal code in that case (which may be your main point), there is no need to write assembly to get the speed-up you describe, you can iterate the backing array in C. The compiler may or may not generate SIMD code however, so you may want to use SIMD intrinsics which is very close to assembly.

That's entirely fair. And tbh most of the time I'm looking at a hot loop where the compiler did something dumb, the first question I ask myself is "is there some other way I could write this code so that the compiler will recognize that the more performant code is an option". Compilers are really quite good and fully featured these days, so there usually is some code transformation or pragma or compiler flag that will work for my specific use case.

If we can get SC LLMs, this problem would fade away and the initial quote would become 100% true. Also a SC LLM could directly write optimized code in assembler (that would define hypercoder LLM ? And the end of programing languages ?).