If you disagree with something I write, I want to hear! I often find myself posting mainly because I want the best available counterarguments.

Sorted by New

Understanding “Deep Double Descent”

I took a look at the Colab notebook linked from that blog post, and there are some subtleties the blog post doesn't discuss. First, the blog post says they're using gradient descent, but if you look at the notebook, they're actually computing a pseudoinverse. [EDIT: They state a justification in their notebook which I missed at first. See discussion below.] Second, the blog post talks about polynomial regression, but doesn't mention that the notebook uses Legendre polynomials. I'm pretty sure high degree Legendre polynomials look like a squiggle which closely follows the x-axis on [-1, 1] (the interval used in their demo). If you fork the notebook and switch from Legendre polynomials to classic polynomial regression (1, x, x^2, x^3, etc.), the degree 100,000 fit appears to be worse than the degree 20 fit. I searched on Google and Google Scholar, and use of Legendre polynomials doesn't seem to be common practice in ML.

Understanding “Deep Double Descent”

The decision tree result seemed counterintuitive to me, so I took a look at that section of the paper. I wasn't impressed. In order to create a double descent curve for decision trees, they change their notion of "complexity" (midway through the graph... check Figure 5) from the number of leaves in a tree to the number of trees in a forest. Turns out that right after they change their notion of "complexity" from number of leaves to number of trees, generalization starts to improve :)

I don't see this as evidence for double descent per se. Just that ensembling improves generalization. Which is something we've known for a long time. (And this fact doesn't seem like a big mystery to me. From a Bayesian perspective, ensembling is like using the posterior predictive distribution instead of the MAP estimate. BTW I think there's also a Bayesian story for why flat minima generalize better -- the peak of a flat minimum is a slightly better approximation for the posterior predictive distribution over the entire hypothesis class. Sometimes I even wonder if something like this explains why Occam's Razor works...)

Anyway, the authors' rationale seems to be that once your decision tree has memorized the training set, the only way to increase the complexity of your hypothesis class is by adding trees to the forest. I'd rather they had kept the number of trees constant and only modulated the number of leaves.

However, the decision tree discussion does point to a possible explanation of the double descent phenomenon for neural networks. Maybe once you've got enough complexity to memorize the training set, adding more complexity allows for a kind of "implicit ensembling" which leads to memorizing the training set in many different ways and averaging the results together like an ensemble does.

It's suspicious to me that every neural network case study in the paper modulates layer width. There's no discussion of modulating depth. My guess is they tried modulating depth but didn't get the double descent phenomenon and decided to leave those experiments out.

I think increased layer width fits pretty nicely with my implicit ensembling story. Taking a Bayesian perspective on the output neuron: After there are enough neurons to memorize the training set, adding more leads to more pieces of evidence regarding the final output, making estimates more robust. Which is more or less why ensembles work IMO.

Gears-Level Models are Capital Investments

offer outside-the-box insights

I don't think that's the same as "thinking outside the box you're given". That's about power of extrapolation, which is a separate entangled dimension.

Anyway, suppose I'm thinking of a criterion. Of the integers 1-20, the ones which meet my criterion are 2, 3, 5, 7, 11, 13, 17, 19. I challenge you to write a program that determines whether a number meets my criterion or not. A "black-box" program might check to see if the number is on the list I gave. A "gears-level" program might check to see if the number is divisible by any integer besides itself and 1. The "gears-level" program is "within the box" in the sense that it is a program which returns True or False depending on whether my criterion is supposedly met--the same box the "black-box" program is in. And in principle it doesn't have to be constructed using prior knowledge. Maybe you could find it by brute forcing all short programs and returning the shortest one which matches available data with minimal hardcoded integers, or some other method for searching program space.

Similarly, a being from another dimension could be transported to our dimension, observe some physical objects, try to make predictions about them, and deduce that F=ma. They aren't using prior knowledge since their dimension works differently than ours. And they aren't "thinking outside the box they're given", they're trying to make accurate predictions, just as one could do with a black box model.

Gears-Level Models are Capital Investments

Of gears-level models that don't make use of prior knowledge or entangled dimensions?

Ultra-simplified research agenda

Theory of mind is something that humans have instinctively and subconsciously, but that isn't easy to spell out explicitly; therefore, by Moravec's paradox, it will be very hard to implant it into an AI, and this needs to be done deliberately.

I think this is the weakest part. Consider: "Recognizing cat pictures is something humans can do instinctively and subconsciously, but that isn't easy to spell out explicitly; therefore, by Moravec's paradox, it will be very hard to implant it into an AI, and this needs to be done deliberately." But in practice, the techniques that work best for cat pictures work well for lots of other things as well, and a hardcoded solution customized for cat pictures will actually tend to underperform.

Gears-Level Models are Capital Investments

I can imagine modeling strategies which feel relatively "gears-level" yet don't make use of prior knowledge or "think outside the box they're given". I think there are a few entangled dimensions here which could be disentangled in principle.

Gears-Level Models are Capital Investments

Gears-level insights can highlight ideas we wouldn't even have thought to try, whereas black-box just tests the things we think to test... it can't find unknown unknowns

It seems to me that black-box methods can also highlight things we wouldn't have thought to try, e.g. genetic algorithms can be pretty creative.

The LessWrong 2018 Review

Is there some way I can see all the posts I upvoted in 2018 so I can figure out which I think are worthy of nomination?

Compiling the results into a physical book. I find there's something... literally weighty about having your work in printed form. And because it's much harder to edit books than blogposts, the printing gives authors an extra incentive to clean up their past work or improve the pedagogy.

Physical books are also often read in a different mental mode, with a longer attention span, etc. You could also sell it as a Kindle book to get the same effect. Smashwords is a service that lets you upload a book once and sell it on many different platforms.

The end of the review process includes a straightforward vote on which posts seem (in retrospect), useful, and which seem "epistemically sound". This is not the end of the conversation about which posts are making true claims that carve reality at it's joints, but my hope is for it to ground that discussion in a clearer group-epistemic state.

Is the idea to only include in the review those posts which are almost universally regarded as "epistemically sound"?

Thanks Preetum. You're right, I missed that note the first time -- I edited my comment a bit.

It might be illuminating to say "the polynomial found by iterating weights starting at 0" instead of "the polynomial found with gradient descent", since in this case, the inductive bias comes from the initialization point, not necessarily gradient descent per se. Neural nets can't learn if all the weights are initialized to 0 at the start, of course :)

BTW, I tried switching from pseudoinverse to regularized linear regression, and the super high degree polynomials seemed more overfit to me.