One of the dominant paradigms in image processing today is transfer learning from well established image classification algorithms (such as DenseNet, VGG, or Inception) to other image processing applications. The idea is that in the process of learning to classify types of images, the networks learn to recognize distinctive visual features at different layers. These can start from lines in different directions and build up to colors, curves, concentric circles, corners, all the way to the classes used for the original applications like faces or dog species.

These lower level features, stored primarily in the earlier layers of the network, are in some sense “visually fundamental.” In the same way that computers display images by building them out of pixels we build ideas about images based on an encoded breakdown of the image into visual parts, which the network replicates.

For a wide variety of tasks, these features can be incredibly useful, giving good results even with limited data and small amounts of training time, since the visual features are already sufficient to organize the data. However in some cases transfer learning can fail, and even substantial train times after tuning in a number of ways can lead to only modest results.

What could be happening in these cases?

Our general theory goes:

low level features (like lines and curves) → mid level features (like corners and gradients) → high level features (like eyes or cups) → classification of images

We can have some confidence in the existence and reliability of chains like these because we can see the high levels of activations of neurons in the network corresponding to these features, and because we can observe the effectiveness in practice in many cases. So how can this framework fail to deliver?

It could be that the high level features needed to distinguish between the relevant classes aren’t available. For example, if your classifier wants to deduce whether a button seen by a button hole is buttoned or not, you might be able to detect buttons, but the model might have previously ignored whether the button is occluded, or may not have features for the nearby overlap of fabric or the pressure on the outer fabric that might indicate that the button is fastened. In this case the low level features are likely useful, and retraining more layers of the model or using features from an earlier layer of the model to transfer from could be effective. However, as with any time you train a more complex model, this may require more data, thus somewhat obviating the usefulness of transfer learning.

It could also be that the mid level features are poorly calibrated. I’m a bit uncertain as to what this means, in part because what constitutes a “mid-level feature” has not been rigorously defined. But one could imagine that this might be the case if the distribution of images is VERY different from that the network was originally trained on. For example, if your new task is related to identifying nearby pieces of difficult jigsaw puzzles, you may require very precise pixel-to-pixel matching that isn’t as interesting in a classification context, where you usually only have one relevant object visible in the image overall. In this case, again it might be possible to retrain more of the model, but this would often require more data than might be available in a situation where transfer learning is the best option. While this is likely to be the issue I’m facing in the project I’m currently working on, I have absolutely no idea how to address this problem and would love to hear any feedback on it—especially of the form “this is the technical term people usually use when asking about this problem.”

Finally, based on this stupidly simple model I’ve proposed, it could be that the low level features such as detection of edges in the image aren’t properly calibrated for your dataset. I would expect this to happen when the type of image being considered has shifted drastically—for example, moving from classifying real life dog species to cartoon dogs. In this case, the pixel-to-pixel nature of the images are completely different, and transfer learning may not be able to help you. Or maybe it could! In the example I mentioned where I might expect this to happen, the styles of images are different, and changing image styles via transfer learning is actually well established. If you inherit a photorealistic style over your cartoon dogs, you will probably first end up with some pretty horrific visuals, which is always a plus. Then you’ll be back in a position where your network can identify basic features, and with any luck the different visual style will still contain comparable elements in similar relative positions which can be used to identify classes reasonably well. Unfortunately I would imagine that you might encounter trouble with things like the simplification of elements in cartoon images and run into a situation where your distributions of low level features are very different and now your mid level features don’t work well, the least hopeful situation (at this point from my perspective). However unlike the situation where this was your problem to begin with, you’ve at least created some horrific modern art along the way.

New to LessWrong?

New Comment