Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

With thanks to Lee Sharkey and Michael Cohen for the conversations that lead to these ideas.

In a previous post, I talked about how we could train classifiers on the same classification problem - a set of lions vs a set of huskies - but using different approaches to classify.

What we want is something we can informally call a 'basis' - a collection of classifiers that are as independent of each other as possible, but that you can combine to generate any way of dividing those two image sets. For example, we might have a colour classifier (white vs yellow-brown), a terrain classifier (snow vs dirt), a background plant classifier, various classifiers on the animals themselves, and so on. Then, if we've done our job well, when we find any not-too-complex classifier , we can say that it's something like ' colour, nose shape and plant[1]'.

We shouldn't put too much weight on that analogy, but we do want our classifiers to be independent, each classifier distinct from anything you can construct with the all others.

Here are four ways we might achieve this this.

Randomised initial seeds

An easy way of getting an ensemble of classifiers is to have bunch of neural nets (or other classification methods), initialise them with different initial weights, and train them on the same sets. And/or we could train them on different subsets of the lion and husky sets.

The advantage of this method is that it's simple and easy to do - as long as we can train one classifier, we can train them all. The disadvantage is that we're relying on luck and local minima to do the job for us. In practice, I expect these methods to all converge to "white vs yellow-brown" or similar. Even if there are local minima in the classification, there's no guarantee that we'll find them all, or even any. And there's no guarantee that the local minima are very independent - colour and nose shape might be a local minima, but it's barely different from a colour classifier.

So theoretically, this isn't sound; in practice, it's easy to implement and play around with, so might lead to interesting insights.

Distinct internal structure

Another approach would be to insist the classifiers internal structures are distinct. For example, we could train two neural net classifiers, with weights and with . They could be trained to minimise their individual classification losses and regularisations, while ensuring that and are distinct; so a term like would be added to the loss function.

This approach has the advantage of forcing the classifier to explore a larger space, and is not restricted to finding local minima. But it's still theoretically unsatisfactory, and there's no guarantee that the classifiers will really be distinct: and may still end up as colour classifiers, classifying the same colour in two very different ways.

Distinct relative to another set

In the previous methods, we have defined independence relative to the classifiers themselves, not to their results. But imagine now that we had another unlabelled set of images , consisting of, say, lots of varied animal images.

We can now get a theoretical definition of independence: and are independent if they give similar results on the lion-vs-husky problem, but are distinct on .

We might imagine measuring this difference directly on : then knowing the classification that gives on any element of , tells us nothing about what would give. Or we could use is a more semi-supervised way: from these images, we might extract features and concepts like background, fur, animal, tree, sky, etc. Then we could require that and classify huskies and lions using only those features; independence being enforced by the requirement that they use different features, as uncorrelated as possible.

This seems an area of promising research.

Distinct in some idealised sense

What if was the set of all conceivable images? Then, if we applied the previous method, we'd get a "maximal" collection of classifiers, spanning all the possible ways that husky-vs-lion classifiers would be different.

I won't add anything to this section currently, as the idea is clearly intractable as stated, and there's no certainty that there is a tractable version. Still, worth keeping in mind as we develop the other methods.


  1. The meaning that it actually internally classifies the plants the wrong way round, but still separates the sets correctly, because of the strength of its colour and nose shape classifications. ↩︎

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 8:48 AM

Just wanted to flag Lakshminarayanan et al. as a standard example of the "train ensemble with different initializations" approach

A fifth method: you could use unsupervised learning to learn some multidimensional representations for the images, and then understand the classifier's diversity in terms of the diversity of these multidimensional representations.

Interesting; would you need an unlabelled dataset to do this, or would the lion and husky sets be sufficient?

I think it depends on your method. Most current methods would probably heavily benefit from a diverse dataset, as they tend to be based on somehow "compressing" or "clustering" the images. However, it seems like they should still work to an extent on the monotonous husky/lion datasets, just not as much.

However, if one is willing to go beyond current methods, then I feel like it should be possible to make unsupervised representation learning methods that are better able to deal with monotonous datasets. I'm sort of playing with some ideas for this in my spare time, because it seems like a promising approach for me, though I haven't developed them much yet.

It seems like there must be some decent ways to see how different two classifiers are, but I can only think of unprincipled things.

Two ideas:

Sample a lot of items and use both models to generate two rankings of the items (or log odds or some other score). Models that give similar scores to lots of examples are probably pretty similar. One problem with this is that optimizing for it when the problem is too easy will train your model to solve the problem a random way and then invert the ordering within the classes. (A similar solution with a similar problem is judging model similarity by how similarly they respond to deleting parts of the image.)

Maybe you could split the models into two parts, which we might hope were a "feature extractor" part and a "simple classifier" part. (Potentially a reconstruction loss could be added at the split to try to encourage the features to stay feature-y, but maybe it's not too important.) Then you measure how different two models are by training a third classifier that's given access to the features from both models, and seeing by how much it outperforms the originals.