I think they are less friendly if you want to do something more complicated with functions accepting more than 1 argument, but this seems solvable.
Indeed, this is called reverse Polish / postfix notation. For example, f(x, g(y, z)) becomes x (y z g) f, which is written without parentheses as x y z g f. That is, if you know the arity of all the letters beforehand, parentheses are unnecessary.
This seems pretty similar to the distinction between high-decouplers and low-decouplers (or "decouplers" and "contextualizers"). See here.
The Tickle Defense arguably makes EDT behave like CDT. The topic EDT=CDT was discussed in this forum before. Though that was years ago, since decision theory has somewhat fallen out of fashion.
I mean yes: If you do good things, your image will probably improve. As I said, that part doesn't seem wrong to me.
Apart from that, I'm not convinced we actually know that he didn't have genuine altruistic interest in donating money to certain charities. Do we have evidence one way or the other?
Also, just practically speaking, if it was for publicity, it would have been more effective to donate to cancer hospitals or starving orphans in Africa than to a weird and seemingly cult-like organization like SIAI.
I think even "launder their reputation" is too cynical. If a bad person does something good, via donation or otherwise, then that's arguably indeed good. Imagine trying to hinder bad people from doing good deeds with the justification that this would launder their bad reputation. That would assume that a bad person doing something good is actually bad, which is seems false. It is true that doing good things makes them seem less bad, but that's only because doing good things actually makes you less bad overall. (How much less bad is another question.)
There are a bunch of papers on this from analytic philosophy. Probably also from linguistics. The problem is the logical analysis of statements like "Men are taller than women", which is arguably true but not vacuously true.
Well, according to De Morgan's laws, is indeed equivalent to . So if is high (because is low), is high. However, conjunctive arguments usually rely on an assumption of independence, while disjunctive arguments assume mutual exclusivity. I'm not sure whether these properties can be transformed into each other when switching between disjunctive and conjunctive arguments.
Another issue with continual learning is that it likely doesn't have the efficiency of today's cloud-based LLMs:
inference-time gradient computation and per-user weight divergence breaking the efficiency of batched serving. (Are We in a Continual Learning Overhang?)
A similar thing was pointed out 14 years ago in this post. Anyway, I like the way you phrased it, very precise.
I'd say the general lesson is probably this: Arguments of the form of "X is highly conjunctive in ways 1, 2, 3, ..., n, therefore it is very unlikely" and "X is highly disjunctive, in ways 1, 2, 3, ..., n, therefore it is very likely" are not necessarily wrong, but easily misleading, because they trick us into underestimating or overestimating probabilities. The best solution is probably to avoid making these types of arguments and only present a few (strong) reasons for/against something.
I think most things mentioned in 1.4 ("Algorithmic changes that are not really quantifiable as efficiency") belong to 1.1 (algorithmic efficiency progress) because they can actually be quantified as efficiency improvements, namely SFT, RLHF, RLVR. These have strongly increased capabilities, as measured by benchmarks, compared to GPT-3-style prompt engineering of the underlying base model. So a much smaller model with these improvements can get to the performance of a larger base model without them.
Especially the invention and subsequent improvement of RLVR has made things possible (advanced math, programming, agentic tool-use, answering of any questions that require a non-trivial amount of reasoning) that were far out of reach for older frontier models like GPT-3/GPT-4, which didn't have any robust reasoning ability, apart from the very brittle (hallucination-prone) "think step by step" trick.
I would also include improvements from synthetic training data as an algorithmic improvement, not a data-related improvement, because better synthetic training data is created by better algorithms. E.g. AlphaGo Zero would clearly count as pure algorithmic improvement, because the synthetic data generated during self play is itself the output of an improved algorithm. By the way, more recent forms of RLVR also include self play, which hasn't been appreciated enough in my opinion. (Self play can be classified as a weak form of RSI that is distinct from classical RSI as an AI that does AI research.)
Even in forms of RLVR which rely on human-written training tasks without self play, data-related improvement is not independent of algorithmic progress from RLVR: Because the data would be useless without first inventing RLVR. So it is difficult (as Gundlach apparently tries) to say which of them caused "more" of the progress they caused in combination.
Regarding distillation:
I think model distillation would not cause such a large and ongoing improvement in inference efficiency. When you first invent model distillation (a purely algorithmic type of progress), there is indeed a large one-time efficiency improvement, because a small model created from distillation is suddenly much better than another small model created without distillation. However, subsequently to that, there is no such improvement anymore, i.e. the relative difference between Gemini 2.5 Flash and Gemini 3 Flash (distilled models) presumably matches approximately the relative difference between Gemini 2.5 Pro and Gemini 3 Pro. So you now have 0x further improvement from model distillation, as the improvement in the smaller models matches the improvement in the larger ones. Unless the distillation method itself improved -- but this would again count as progress caused by an algorithmic efficiency improvement.