Zachary Robertson

Comments

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

You seem to have updated your opinion: overtraining does make difference, but it’s not ‘huge’. Have you run a significance test for your lines of best fit? The plots as presented suggest the effect is significant.

Figure C.1.a indicates the tilting phenomena. Probabilities only go up to one so tilting down means that the most likely candidates from overstrained SGD are less likely with random sampling. Thus, unlikely random sampling candidates are more likely under SGD. At the tail, the opposite happens. Functions more likely with random sampling become less likely under SGD.

While the optimizer has a larger effect, I think the subtle question is whether the overtraining tilts in the same way each time. Figure 16 indicates yes again. This phenomena you consider to be minor is what I found most interesting about the paper.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume.

What I'm suggesting is that volume in high-dimensions can concentrate on the boundary. To be clear, when I say SGD only typically reaches the boundary, I'm talking about early stopping and the main experimental setup in your paper where training is stopped upon reaching zero train error.

We have done overtraining, which should allow SGD to penetrate into the region. This doesn’t seem to make much difference for the probabilities we get.

This does seem to invalidate the model. However, something tells me that the difference here is more about degree. Since you use the word 'should' I'll use the wiggle room to propose an argument for what 'should' happen.

If SGD is run with early stopping, as described above, then my argument is that this is roughly equivalent to random sampling via an appeal to concentration of measure in high-dimensions.

If SGD is not run with early stopping, it's enclosed by the boundary of zero train error functions. Because these are most likely in the interior these functions are unlikely to be produced by random sampling. Thus, on a log-log plot I'd expect overtraining to 'tilt' the correspondence between SGD and random sampling likelihoods downward.

Falsifiable Hypothesis: Compare SGD with overtaining to the random sampling algorithm. You will see that functions that are unlikely to be generated by random sampling will be more likely under SGD with overtraining. Moreover, functions that are more likely with random sampling will be become less likely under SGD with overtraining.

Recognizing Numbers

A problem I'm finding with this formulation is that it moves the problem to something that is arguably harder. We've replaced the problem of recognizing numbers with the problem of recognizing sets. The main post does this as well. There's nothing technically wrong with this, but then the immediate question is this: how do we know when sets are useful? If a similar logic applies: how do we create an abstraction of a set from observation(s)? George Cantor, one of the founders of set theory writes,

A set is a gathering together into a whole of definite, distinct objects of our perception [Anschauung] or of our thought—which are called elements of the set.

To gather distinct perceptions together requires unity of apperception or a single 'I think' to be attached to each perception so that they may be brought under a category/set/etc.

What is going on in the world?

I like this one because you can generate more using GPT3 (that doesn’t imply they make sense)

The Good Try Rule

I think this post does a good job of motivating a definition for “good try”. It also seems possible to think of habit changes as examples of goals. I personally find the SMART goal system to be useful and related to the discussion. SMART goals should be Specific, Measurable, Attainable, Reasonable, Timely. The approach is to specify why the habit change goal meets each of the SMART criteria.

I’d think that giving something a “good try” is similar enough to trying habit change with a SMART goal that I mention this. This makes it clearer (at least for me) that what we’re talking about is creating some sort of prediction about how a successful habit change will proceed and then testing the prediction by attempting the habit change according to the plan. I think this also opens up the opportunity for giving something multiple “good tries” before evaluating success/failure.

Minimal Maps, Semi-Decisions, and Neural Representations

I'm going to have to spend some time unpacking the very compact notation in the post, but here are my initial reactions.

I should apologize a bit for that. To a degree I wasn't really thinking about any of the concepts in the title and only saw the connection later.

First, very clean proof of the lemma, well done there.

Thanks!

Second... if I'm understanding this correctly, each neuron activation (or set of neuron activations?) would contain all the information from some-part-of-data relevant to some-other-part-of-data and the output.

To be honest, I haven't thought about interpreting the monad beyond the equivalence with neural networks. One thing I noticed early on is that you can create sequences of activations that delete information in the limit. For example, the ReLU activation is the limit of the SoftMax (change log base). I think something like this could be seen as abstracting away unnecessary data.

Better yet, it looks like the OP gives a recipe for unpacking those natural abstractions?

I'm not sure. I do think the method can justify the reuse of components (queries) and I wouldn't be surprised if this is a pre-requisite for interpreting network outputs. Most of my interest comes from trying to formalize the (perhaps obvious) idea that anything that can be reduced to a sequence of classifications can be used to systematically translate high-level reasoning about these processes into a neural networks.

I guess it's best to give an example of how I currently think about abstraction. Say we take the position that every object is completely determined by the information contained in a set of queries such that . For a picture, consider designing a game-avatar (mii character) by fiddling around with some knobs. The formalism lets us package observations as queries using return. Thus, we're hypothesizing that we can take a large collection of queries and make them equivalent to a small set of queries. Said another way, we can answer a large collection of queries by answering a much smaller set of 'principle' queries. In fact, if our activation was linear we'd be doing PCA. How we decide to measure success determines what abstraction is learned. If we only use the to answer a few queries then we're basically doing classification. However, if the have to be able to answer every query about then we're doing auto-encoding.

Doing discourse better: Stuff I wish I knew

Yes, but StackExchange has community posts that editable and I think this is nice. I believe edits for normal posts work like you say.

Doing discourse better: Stuff I wish I knew

It could still be useful to see different ‘versions’ of an article and then just vote on the ones that are best.

Richard Ngo's Shortform

Ya, totally messed up that. I meant the AI Alignment Forum or AIAF. I think out of habit I used AN (Alignment Newsletter)

Richard Ngo's Shortform

On AI alone (which I am using in large part because there's vaguely more consensus around it than around rationality), I think you wouldn't have seen almost any of the public write-ups (like Embedded Agency and Zhukeepa's Paul FAQ) without LessWrong

I think a distinction should be made between intellectual progress (whatever that is) and distillation. I know lots of websites that do amazing distillation of AI related concepts (literally distill.pub). I think most people would agree that sort of work is important in order to make intellectual progress, but I also think significantly less people would agree distillation is intellectual progress. Having this distinction in mind, I think your examples from AI are not as convincing. Perhaps more so once you consider the Less Wrong is often being used more as a platform to share these distillations than to create them.

I think you're right that Less Wrong has some truly amazing content. However, once again, it seems a lot of these posts are not inherently from the ecosystem but are rather essentially cross-posted. If I say a lot of the content on LW is low-quality it's mostly an observation about what I expect to find from material that builds on itself. The quality of LW-style accumulated knowledge seems lower than it could be.

On a personal note, I've actively tried to explore using this site as a way to engage with research and have come to a similar opinion as Richard. The most obvious barrier is the separation between LW and AIAF. Effectively, if you're doing AI safety research, to second-order approximation you can block LW (noise) and only look at AIAF (signal). I say to second-order because anything from LW that is signal ends up being posted on AIAF anyway which means the method is somewhat error-tolerant.

This probably comes off as a bit pessimistic. Here's a concrete proposal I hope to try out soon enough. Pick a research question. Get a small group of people/friends together. Start talking about the problem and then posting on LW. Iterate until there's group consensus.

Load More