TL/DR; The invention of Convolution Neural Networks for image and audio processing was a key landmark in machine learning.

This topic is for people who already know what CNNs are, and are interested in how to innovate to riff on and extend the core reason (perhaps?) that CNNs learn faster. Probing the technology topic is one 'sub goal' in questioning where our AI knowledge is heading, and how fast. In turn, that's because we want it to progress in a good direction.

Sub Goal

Q: Can the reduction in number of parameters that a CNN introduces be achieved in a more general way?

A: Yes. Here are sketches of two ways:

1) Saccades. Train one network (layer) on attention. Train it to learn which local blocks of the image to give attention to. Train the second part of the network using those chosen 'local blocks' in conjunction with coordinates of their locations.

The number of blocks that have large CNN kernels applied to them is much reduced. Those blocks are the blocks that matter.

2) Parameter Compression. Give each layer of a neural network more (potential) connections than you think will actually end up being used. After training for a few cycles, compress the parameter values using a lossy algorithm, always choosing the compression which scores best on some weighting of size and quality. Uncompress and repeat this process till you have completed the training set.

The number of bits used to represent parameters is being kept low, helping to guard against over fitting.


[Doubter] This all sounds very hand wavy. How exactly would you train a saccadic network on the right movements?

[Optimist] One stepping stone, before you get to a true saccadic network with the locus of attention following a temporal trajectory, is to train a shallow network to classify where to give attention. So this stepping stone outputs a weighting for how much attention to give to each location. For sake of being more concrete, it works on a down sampled image and gives 0 for no attention, 1 for convolution with a 3x3 kernel, 2 for convolution with a 5x5 kernel.

[Doubter] You still haven't said how you would do that attention training.

[Optimist] You could reward a network for robustness to corruption of the image. Reward it for zeroes in the attention layers.

[Doubter] That's not clear, and I think there is a Catch 22. You need to have analysed the image to decide where to give it attention.

[Optimist] ...but not analyse in full detail. Use only a few down sampled layers to decide where to give attention. You save a ton of CPU by only giving more attention where it is needed.

[Doubter] I really doubt that. You will pay for that saving many times over by the less regular pattern of 'attention' and the more complex code. It will be really hard to use a GPU to accelerate it as well as is already done with a standard CNNs. Besides, even a 16x reduction in total workload, and I actually doubt there would be any reduction in workload at all, is not that significant. What actually matters is the quality of the end result.

[Optimist] We shouldn't be worrying about that GPU. That's 'premature optimisation'. You're artificially constraining your thinking by the hardware we use right now.

[Doubter] Nevertheless, GPU is the hardware we have right now, and we want practical systems. An alternative to CNNs using hybrid CPU/GPU at least has to come close on speed to current CNNs on GPU, and have some other key advantage.

[Optimist] Explainability in a saccadic CNN is better, since you have the explicit weightings for attention. For any output, you can show where the attention is.

[Doubter] But that is not new. We can already show where attention is by looking at what weights mattered in a classification. See for example the way we learned that 'hands' were important in detecting dumbbell weights, or that snow was important in differentiating wolves from dogs.

[Optimist] Right. And those insights in how CNNs classify were really valuable landmarks, weren't they? And now we would have something more direct to do that, as we can go straight to the attention weights. And we can explore better strategies for setting those weights.

[Doubter] You still haven't explained exactly how the attention layers would be constructed, nor have you explained the later 'better strategies' nor how you would progress to temporal attention strategies. I doubt the basic idea would do more than a slightly deeper CNN would. Until I see an actual working example, I'm unconvinced. Can we move onto 'parameter compression'?

[Optimist] Sure.


[Doubter] So what I am struggling with is that you are throwing away data after a little training. Why 'lossy compression' and not 'lossless compression'?

[Optimist] That's part of the point of it. We're trying to reward a low bit count description of the weights.

[Doubter] Hold on a moment. You're talking more like a proponent of evolutionary algorithms than of neural networks. You can't 'back propagate' a reward for a low entropy solution back up the net. All you can do is choose one such parameter set over another.

[Optimist] Exactly. Neural networks are in fact just a particular rather constrained case of evolutionary algorithm. I'd contend there is advantage in exploring new ways of reducing the degrees of freedom in them. CNNs do reduce the degrees of freedom, but not in a very general way. We need to add something like compression of parameters if we want low degrees of freedom with more generality.

[Doubter] In CNNs that lack of generality is an advantage. Your approach could encode a network with a ridiculously large number of useless non-zero weights - whilst still using very few bits. That won't work. That would take way longer to compute one iteration. It would be as slow as pitch drops dripping.

[Optimist] Right. So some attention must be paid to exactly what the lossy compression algorithm is. Just as jpeg throws away low weight vectors, this compression algorithm could too.

[Doubter] So I have a couple of comments here. You have not worked out the details, right? It also doesn't sound like this is bio-inspired, which was at least a saving grace of the saccadic idea.

[Optimistic] Well, the compression idea wasn't bio-inspired originally, but later I got to thinking about how genes could create many 'similar patterns' of connections locally. That could do CNN type connections, but genes can also do similar patterns with long range connections. So for example, genes could learn the ideal density of long range connections relative to short range connections. That connection plan gets repeated in many places whilst being encoded compactly. In that sense genes are a compression code.

[Doubter] So you are mixing genetic algorithms and neural networks? That sounds like a recipe for more parameters.

[Optimistic] ...a recipe for new ways of reducing the number of parameters.

[Doubter] I think I see a pattern here, in that both ideas offer CNNs as a special case. With saccadic networks the secret sauce is some not too clear way you would program the 'attention' function. With parameter compression your secret sauce is the choice of lossy compression function. If you 'got funded' to do some demo coding, you could keep naive investors happy for a long while with networks that were actually no better than existing CNNs and plenty of promises of more to come with more funding. But the 'more to come later' never would come. Your deep problem is the 'secret sauce' is more aspiration than actually demonstrable.

[Optimist] I think that's a little unfair. I am not claiming these approaches are implemented demonstrable improvements. I am not claiming that I know exactly how to get the details of these two ideas right quickly. You are also losing sight of the overall goal, which is to progress the value of AI as a positive transformative force.

[Doubter] Hmm. I see only a not-too-convincing claim of being able to increase the power of machine learning and an attempt to burnish your ego and your reputation. Where is the focus on positive transformative force?

[Optimist] Breaking the mould on how to think about machine learning is a pretty important subgoal in progressing thought on AI, don't you think? "Less Wrong" is the best possible place on the internet for engaging in discussion of ethical progression of AI. If this 'subgoal' post does not gather any useful feedback at all, then I'll have to agree with you that my post is not helping progress the possible positive transformative aspects of AI - and try again with another iteration and different post, until I find what works.


7 comments, sorted by Click to highlight new comments since: Today at 8:59 AM
New Comment

I think this could use an opening line or paragraph roughly indicating who the post is for (from the looks of it, people with some background in neural networks, although I couldn't specify it in more detail than that)

Thanks. Paragraph added.

You might be interested in Transformer Networks, which use a learned pattern of attention to route data between layers. They're pretty popular and have been used in some impressive applications like this very convincing image-synthesis GAN.

re: whether this is a good research direction. The fact that neural networks are highly compressible is very interesting and I too suspect that exploiting this fact could lead to more powerful models. However, if your goal is to increase the chance that AI has a positive impact, then it seems like the relevant thing is how quickly our understanding of how to align AI systems progresses, relative to our understanding of how to build powerful AI systems. As described, this idea sounds like it would be more useful for the latter.

The image synthesis is impressive. The Transformer network paper looks intriguing. I will need to read again much more slowly, and not skim to understand it. Thanks for both the links and feedback on aligning AI.

I agree the ideas really are about progressing AI, rather than progressing AI specifically in a positive way. As a post-hoc justification though, exploring attention mechanisms in machine learning indicates that what AI 'cares about' may be pretty deeply embedded in its technology. Your comment, and my need to justify post-hoc, set me to the task of making that link more concrete, so let me expand on that.

I think many animals have almost hard-wired attention mechanisms for alerting them to eyes. Things with eyes are animate, and may need a reaction more rapidly than rocks or trees do. Animals do have almost hard-wired attention mechanisms for sudden movement too.

What alerting or attention setting mechanisms will AIs for self-driving cars have? Probably they will prioritise sudden movement detection. Probably they won't have any specific mechanism for alerting to eyes. Perhaps that's a mistake.

I've noticed that the bounding boxes in some videos of 'what a car sees' are pretty good for following vehicles, but flick on and off for bounding boxes around people on the sidewalk. The stable bounding boxes are relatively square. The unstable bounding boxes are tall and thin.

Now just maybe, we want to make a visual architecture that is very good at distinguishing tall thin objects that could be people, from tall thin objects that could be lamp posts. That has implications all the way down to the visual pipeline. The car is not going to be good at solving trolley problems if it can tell trucks from cars, but can't tell people from lamp posts.

So if I understand you, for (1) you're proposing a "hard" attention over the image, rather than the "soft" differentiable attention which is typically meant by "attention" for NNs.

You might find interesting "Recurrent Models of Visual Attention" by DeepMind ( They use a hard attention over the image with RL to train where to attend. I found it interesting -- there's been subsequent work using hard attention (I thiiink this is a central paper for the topic, but I could be wrong, and I'm not at all sure what the most interesting recent one is) as well.

That paper is new to me - and yes related and interesting. I like their use of a 'glimpse' = more resolution in centre, less resolution further away.

About 'hard' and 'soft' - if 'hard' and 'soft' mean what I think they do, then yes, the attention is 'hard'. It forces some weights to zero that in a fully connected network could end up non zero. That might require some attention in training, as a network that has attention 'way off' where it should be has no gradient to give it better solutions.

Thanks for the link to the paper and the idea of thinking about to what extent the attention is/is-not differentiable.

There is a large existing literature on pruning neural networks, starting with the 1990 paper "Optimal Brain Damage" by Le Cun, Denker and Solla. A recent paper with more references is