There are serious problems with the idea of the 'Bitter Lesson' in AI. In most cases, things other than scale prove to be extremely useful for a time, and then are promptly abandoned as soon as scaling reaches their level, when they could just as easily combine the two, and still get better performance. Hybrid algorithms for all sorts of things are good in the real world.
For instance, in computer science, quicksort is easily the most common sorting algorithm, who uses a pure quicksort? Instead they add on an algorithm that changes the base case, or handles lists with small numbers of entries, and so on. People could have learned the lesson that quicksort is just better than these small-list algorithms once you reach any significant size, but that would have prevented improving quicksort.
Another, unrelated personal example. When I started listening to Korean music, it bothered me that I couldn't understand what they were singing, so on some of my favorite songs, I searched fairly significantly for translations.
When I couldn't find translations for some songs, I decided to translate them myself. I didn't know more than a phrase or two of Korean at the time, so I gave them to an AI (in this case, Google Translate, which had already transitioned to deep learning methods at the time.)
Google Translate's translations were of unacceptably low quality of course, so I used it as a word or short phrase reference. Crucially, I didn't know Korean grammar at the time either (which is in a different order than English.) It took me a long time to translate those first songs, but not anywhere near enough to be much training on the matter.
So how did my translations of a language I didn't know compare to the DL used by Google? It was as different as night and day in favor of my translation. It's not because I'm special, but because I used those methods people discarded in favor of 'just scale'. What's a noun? What's a verb? How do concepts fit together? Crucially, what kind of thing would a person be trying to say in a song, and how would they say it? Plus, checking that backwards, if they were trying to say this thing, would they have said it in that way? I do admit to a bit of prompt-engineering even though I didn't know what that was at the time, but that means I knew that a search through similar inputs would give a better result instinctively.
I have since improved massively by learning how grammar works in Korean (which I started learning 8 months later). By learning more about the concepts used in English grammar. What the words mean. Etc. Simple practice too, of course. Leveled off after a bit, but it started improving again when I consulted an actual dictionary; why don't we give these LLM's real dictionaries? But I could improve on it without any of that by using simple concepts and approaches we refuse to supply these AIs with.
Google Translate has since improved quite a lot, but is still far behind even those first translations where all I added were the simpler than even grammar structure of language and thought, along with actual grounding in English.. Notably, that's despite the fact that Google Translate is far ahead of more general models like GPT-J. Sure, my scale dwarfs them, obviously, but I was trained with almost no Korean data at all, but because I had these concepts, and a reference, I could do much better.
Even the example of a thing like AlphaGo wasn't so much a triumph of deep-learning over everything, as, if you combine insights like search algorithms (Monte Carlo tree search) with trained heuristics from countless things, it goes much better.
The real bitter lesson is that we give up on improvements out of some misguided purity.
No. That's a foolish interpretation of domain insight. We have a massive number of highly general strategies that nonetheless work better for some things than others. A domain insight is simply some kind of understanding involving the domain being put to use. Something as simple as whether to use a linked list or an array can use a minor domain insight. Whether to use a monte carlo search or a depth limited search and so one are definitely insights. Most advances in AI to this point have in fact been based on domain insights, and only a small amount on scaling within an approach (though more so recently). Even the 'bitter lesson' is an attempted insight into the domain (that is wrong due to being a severe overreaction to previous failure.)
Also, most domain insights are in fact an understanding of constraints. 'This path will never have a reward' is both an insight and a constraint. 'Dying doesn't allow me to get the reward later' is both a constraint and a domain insight. So is 'the lists I sort will never have numbers that aren't between 143 and 987' (which is useful for and O(n) type of sorting). We are, in fact, trying to automate the process of getting domain insights via machine with this whole enterprise in AI, especially in whatever we have trained them for.
Even, 'should we scale via parameters or data' is a domain insight. They recently found out they had gotten that wrong (Chinchilla) too because they focused too much on just scaling.
Alphazero was given some minor domain insights (how to search and how to play the game), years later, and ended up slightly beating a much earlier approach, because they were trying to do that specifically. I specifically said that sort of thing happens. It's just not as good as it could have been (probably).