I'd definitely recommend running some empirical experiments. I have been in a similar boat where I sort of feel like theorizing and feel an ugh field around doing the hard work of actually implementing things.
Fortunately now we have Claude Code, so in theory, running these experiments can be a lot easier. However, you still have to pay attention and verify its outputs, enough to be very sure that there isn't some bug or misspecification that'll trip you up. This goes for other AI outputs as well.
The thing is, people are unlikely to read your work unless you show something clearly surprising or impressive. Like, if you claim to have a solution for an outstanding mathematical problem or a method that is SotA on some benchmark, people are much more likely to pay attention. Even then, people will apply a lot of scrutiny to make sure your claims are true, and you really have to do your homework ruling out every other possibility.
The demand for a high degree of rigor, especially from unproven researchers, is a reality of research. If people didn't apply strict heuristics for what papers to read, whether via the researcher's credentials or the strength of their claims, they'd be spending tons of time trying to understand papers that turn out to be pretty insubstantial.
I don't know enough to properly evaluate the math in your paper - like, I don't know what the Kantorovich-Rubinstein duality is. (Figuring out whether it makes sense to use it here would probably take a nontrivial amount of effort even for a more mathematically-inclined reader. I think getting ML people to read through a highly math-heavy paper may be especially difficult.)
The lack of citations is concerning to me, since it implies you don't really know what other people have had to say on this topic, and what your paper contributes beyond that baseline. Using AI is sometimes fine, but you really have to do the cognitive legwork yourself - citing ChatGPT as a coauthor implies you're using it for a lot more than copyediting or lit review. And reading the text of the Croissant Principle itself, it seems pretty obvious? "Good learning requires generalization, don't overfit to individual data points." It does not make me optimistic about the rest of the paper.
I'm hoping this can be a constructive comment rather than just critical - I guess my first advice would be to start by reading a lot of papers that excite you, seeing how they structure their arguments, and getting a sense of what the important questions in your chosen subfield are. Maybe you have been reading a lot of papers already, but I recommend reading more. Then do those empirical experiments - make a falsifiable prediction which you're not sure whether it's true (even after doing a thorough literature review) and go find out if it's true!
Ideally you could get mentorship too, which is sort of a chicken-and-egg problem since you generally have to have some legible credentials in order for a mentor to want to spend their time helping you. I think SPAR is pretty good for early-career researchers though.
Ultimately I think much of this boils down to "put in a lot of (sometimes unpleasant) work to get better at research." This recent thread also summarizes some good research advice. Best of luck!
Yeah I was originally envisioning this as an ML theory paper which is why it's math-heavy and doesn't have experiments. Tbh, as far as I understand, my paper is far more useful than most ML theory papers because it actually engages with empirical phenomena people care about and provides reasonable testable explanations.
Ha, I think some rando saying "hey I have plausible explanations for two mysterious regularities in ML via this theoretical framework but I could be wrong" is way more attention-worthy than another "I proved RH in 1 page!" or "I built ASI in my garage!"
Mmm, I know how to do "good" research. I just don't think it's a "good" use of my time. I honestly don't think adding citations and a lit review will help anybody nearly as much as working on other ideas.
PS: Just because someone doesn't flash their credentials, doesn't mean they don't have stellar credentials ;)
Rereading at your LessWrong summary, it does feel like it's written in your own voice, which makes me a bit more confident that you do in fact know math. Tbh I didn't get a good impression from skimming the paper, but it's possible you actually discovered something real and did in fact use ChatGPT mainly for editing. Apologies if I am just making unfounded criticisms from the peanut gallery
Oh yes I do know math lol. Yeah the summary above hits most of the main ideas if you're not too familiar with pure math.
Thanks interesting! I had not read this paper before.
Some initial thoughts:
I recently wrote an ML theory paper which proposes explanations for mysterious phenomena in contemporary machine learning like data scaling laws and double descent. Here's the link to the paper and the Twitter thread. I didn't get much attention and need an endorser to publish on ArXiv so I thought I'd post it here and get some feedback (and maybe an endorser!)
Essentially what the paper does is propose that all data in a statistical learning problem arises from a latent space via a generative map. From this we derive an upper bound on the true loss as depending on the training/empirical loss, the distance in latent space to the closest training sample where the model attains better than the training loss, and the compressibility of the model (similar to Kolmogorov complexity).
Barring a (reasonable) conjecture which nonetheless is not proved, we are able to explain why data scaling follows a power law as well as the exact form of the exponent. The intuition comes from Hausdorff dimension which measures the dimension of a metric space.
Imagine you are building a model with 1-dimensional inputs, let's say in the unit interval . Let's say you have ten training samples distributed evenly. If the loss of your model is Lipschitz (doesn't change unboundedly fast e.g. for smooth enough functions, derivative is bounded), your model can't get loss on any test sample greater than the loss at the closest point plus the distance to that point (capped at around 1/10) times the Lipschitz constant (bound on the derivative).
If you want to improve generalization, you can sample more data. If these are spaced optimally (evenly), the maximum distance to a training sample decreases like as can be easily seen. However, if you were working with 2 dimensional data, it would scale like ! Hausdorff dimension essentially defines the dimension of a metric space as the number such that this scales like .
If you now put these two facts together, you get that the generalization gap (gap between true loss and training loss) is where is the number of training data samples and is the Hausdorff dimension of the latent space. In other words, we have a concrete explanation of data scaling laws!
It's worth noting that this analysis is independent of architecture, task, loss function (mostly) and doesn't even assume that you're using a neural network! It applies to all statistical learning methods. So that's pretty cool!
The second major phenomenon we can explain is double descent and in fact utilizes an existing framework. Double descent is the phenomenon where as relative parameters per data sample increase, first eval loss decreases then increases, as classical learning theory predicts, but then decreases again! This last part has been quite the mystery in modern ML.
We propose an explanation. The generalization gap has long been known to be bounded by a term depending on the complexity of the model. For small models, increasing parameters helps fit the data better, driving down training and eval loss. Eventually you start to overfit and the complexity skyrockets, causing eval loss to rise. However as you continue increasing parameters, the space of possible models continues to expand so that it now contains models which both fit the data well and have low complexity! This drives eval loss down again, if you can find these models. This fits with empirical observations that enormous models are simple (have low-rank weight matrices) and that sparse subnetworks can do just as well as the full model and the existence of abnormally important "superweights".
So yeah, we provide plausible explanations for two major mysteries of machine learning. I think that's really cool! Unfortunately I don't really have the following or institutional affiliation to get this widely noticed. I'd love your feedback! And if you think it's cool too, I'd really appreciate you sharing it with other people, retweeting the thread, and offering to endorse me so I can publish this on arXiv!
Thanks for your time!