Mikewins
Mikewins has not written any posts yet.

Mikewins has not written any posts yet.

Honestly for me it's more of a strike against RNNs. Real deep neural networks that have been trained don't have this property, so it's a bridge we're going to need to cross at some point regardless. From a derisking point of view I'd kind of like to get to that point ASAP. There's a lot of talk about looking at random boolean circuits (which very obviously don't have this property), narrow MLPs, or even jumping all the way to wide MLPs trained in some sort of mean-field/maximum update regime that gets rid of it.
I am affiliated with ARC and played a major role in the MLP stuff
I'm loosely familiar with Greg Yang's work, and very familiar with the 'Neural Network Gaussian Process' canon. It's definitely relevant, especially as an intuition pump, but it tends to answer a different question. They answer 'what is the distribution of quantities x y and z over the set of all NNs' where quantities x y and z might be some preactivation on specific inputs. Knowing that they are jointly gaussian with such-and-such covariance has been a powerful intuition pump for us. But the main problem we want is an algorithm that takes in a specific NN with specific weights... (read more)
I am affiliated with ARC and played a major role in the MLP stuff
The particular infinite sum discussed in this post is used for approximating MLPs with just one hidden layer, so things like vanishing gradients can't matter.
We are now doing work on deeper MLPs. In this case, the vanishing gradients story definitely does seem relevant. We definitely don't fully understand every detail, but I'll mouth off anyways.
On one hand, there are hyperparameter choices where gradients explode. It turns out that in this regime, matching sampling: exponentially large gradients mean that the average is a sum over exponentially many tiny and more-or-less independent regions, so it's exponentially small and you can beat... (read more)
Does the 1/sqrt(N) error for SGD assume single-pass? It seems like if we're bottlenecked on few data points we can use multi-pass and do nearly as well as bayesian (at least for half-spaces).