Thanks for the clarification Zach!
Since I've been accused of lazy reading, I want to finish off the terminological discussion and put you in my shoes a bit. I get your motivation for preferring to call it "program synthesis" over "program induction," but it turns out that's an established term with about 60 years of history. Basically, to understand how I read it: replace every use of "program synthesis" with "techniques for searching constrained spaces of programs using things like SAT solvers and Monte Carlo." If you also replace "Daniel Selsam" with "researcher in SAT solvers who started off in a lab that uses Monte Carlo to generate assembly programs," then I actually think it becomes hard not to read it the way that I did -- the way that you said was the opposite of the intended reading. And there aren't really any clear cues that you are not talking about program synthesis in the existing sense -- no clear "I define a new term" paragraph. You might think that the lack of citation to established synthesis researchers would be a tell, but, unfortunately, experience has taught me that's fairly weak evidence of such.
So I read it again, this time replacing "program synthesis" with "Solomonoff induction." It does indeed read very differently.
And I read your last comment.
And my main reaction was: "Wait, you mean many people don't already see things this way?"
I mean, it's been a full decade since Jacob Andreas defended his thesis on modularity in neural networks. I just checked his Google Scholar. First paper (>1600 citations, published 2016), abstract, first sentence: "Visual question answering is fundamentally compositional in nature."
If neural nets are not doing a simplicity-biased search over compositional solutions, then what are they doing? Finding the largest program before the smaller ones? Not obtaining intermediate results? Not sharing substructure across tasks while still using a smaller number of weights? Developing novel algorithms that solve the problem in one-shot without any substeps?
I'd naively expect neural nets to, by default, be about as modular as biological systems, or programs evolved with GP. In many ways, very modular, but also with a lot of crazy coupling and fragility. Neural nets gain efficiency over GP by being able to start with a superposition of a vast number of solutions and grow or shrink many of them simultaneously. They also can be given stronger regularization than in biology. I would expect them to find a simple-ish solution but not the simplest. If they only found the most complex ones, that would be way more impressive.
Hmmm. Okay then, I'd like to understand your point.
But first, can we clear up some terminological confusion?
From this comment, it seems you are using "program synthesis" in ways which are precisely the opposite of its usual meaning. This means I need to do substantial reverse-engineering of what you mean in every line, in addition to trying to understand your points directly.
These are narrow hypothesis classes where you, the practitioner, chose a representation embedding your assumptions about problem structure.
This is a very confusing thing to write. I think you're saying that various techniques are differentiated from program synthesis because they "choose a representation embedding assumptions about the problem structure." But can you point to any papers in the program synthesis literature which don't do this?
But the bigger reason I chose "synthesis": I expect deep learning is doing something closer to constructing programs than enumerating over them. Gradient descent seems not to be (and couldn't possibly be) doing brute-force search through program space,
I think you're saying that you use the term "program synthesis" to mean "things that are not brute-force searches over program space."
But a brute-force search over program space is a perfectly valid synthesis technique! It's literally the first technique discussed in my advisor's Intro to Program Synthesis course. See https://people.csail.mit.edu/asolar/SynthesisCourse/Lecture2.htm (scroll down to "Explicit Enumeration").
Hi! I did my Ph. D. in a program synthesis lab that later become a mixed program synthesis / machine learning lab. "Machine learning is program synthesis under a different name" my advisor would say.
But my experienced turned out to not be very relevant to reading this post, because, I must say...I did not get what the point of this post is or the intended takeaways.
In genetic programming, there's a saying "Learning programs is generalized curve-fitting." And, indeed, the first chapter of "Foundations of Genetic Programming" is about trying to evolve a small AST that fits a curve. I gave a similar problem to my students as their intro to enumerative program synthesis.
As far as I can tell, that's the entire meat of this post. Programs can be scene as fancy curves. Okay, I see a few paragraphs about that, plus pages and pages of math tangents. What else am I supposed to get from reading?
To be a little more blunt, reading this reminded me of the line "The book 'Cryptonomicon' is 1000 pages of Neal Stephenson saying 'Hey, isn't this cool'" as he goes into random digressions into basic cryptography, information theory, etc. Yes, the mechanistic interpretability of grokking is cool. Yes, it's cool that it's connected to representation theory. No, I have no idea how that's related to any larger thesis in this post.
BTW, you'll probably find the keyword "program induction" more fruitful than "program synthesis." Program synthesis is a PL/FM term that usually refers to practical techniques for generating programs (or other objects that can be phrased as programs) from specs, examples, human feedback, existing code, etc. "Program induction" is an ML term that basically refers to what you're talking about: the philosophy of supervised learning being "learning programs."
I do not know hour they're handling it! Most of my demoscene knowledge comes from a single event by the CMU Computer Club in 2009.
nit: When demosceners hand-write assembly, it's not because they're foregoing a compiler for the sake of it. They operate under performance and size constraints that require having assembly that compilers cannot output. They also often build for older platforms, with only comparatively primitive compilers available.
Consensus view is that they were shielded by those who did invest in it.
I've written more about Y2K at https://www.lesswrong.com/posts/zvQdgfFEDFQQhDDuS/y2k-successful-practice-for-ai-alignment
Small historical nit: China was actually ruled by the Mongols for most of the 1300s.
Am I the only one who finds parts of the early story rather dystopian? He sounds like a puppet being pulled around by the AI, gradually losing his ability to have his own thoughts and conversations. (That part's not written, but it's the inevitable result of asking the AI every time he encounters struggle.)
I made the grave error of framing this post in a way that invites a definition debate. While we are well familiar that a definition debate is similar to a debate over natural categories, which is a perfectly fine discussion to have, the discussion here has suffered because several people came in with competing categories.
I strongly endorse Ben's post, and will edit the top post to incorporate it.
Thank you! This really made your thesis click for me.