*Note: the most up-to-date version of this proposal can be found here.*

## Introduction

If I present you with five examples of burritos, I don’t want you to pursue thesimplestway of classifying burritos versus non-burritos. I want you to come up with a way of classifying the five burritos and none of the non-burritos thatcovers as little area as possible in the positive examples, while still having enough space around the positive examples that the AI can make a new burrito that’s not molecularly identical to the previous ones.

-AI Alignment: Why It's Hard and Where to Start

Consider the problem of designing classifiers that are able to reliably say "I don't know" for inputs well outside of their training set. This problem is studied as *open-category classification* in the literature [6]; a closely-related problem in AI safety is *inductive ambiguity detection* [4].

Solution of generalized open-category classification could allow for agents who robustly know when to ask for clarification; in this post, we discuss the problem in the context of classification. Although narrower in scope, the solution of the classification subproblem would herald the arrival of robust classifiers which extrapolate conservatively and intuitively from their training data.

It's obvious that we can't just teach classifiers the class. One, this isn't compact, and two, it doesn't make sense - it's a map/territory confusion ( is a feature of our current world model, not a meaningful ontological class). Let's decompose the concept a bit further.

Weight regularization helps us find a mathematically-simple volume in the input space which encapsulates the data we have seen so far. In fact, a binary classifier enforces a hyperplane which cleaves the *entire* space into two volumes in a way which happens to nearly-optimally classify the training data. This hyperplane may be relatively simple to describe mathematically, but the problem is that the two volumes created are far too expansive.

In short, we'd like our machine learning algorithms to learn explanations which fit the facts seen to date, are simple, and don't generalize too far afield.

## Prior Work

Taylor et al. provide an overview of relevant research [4]. Taylor also introduced an approach for rejecting examples from distributions too far from the training distribution [5].

Yu et al. employed adversarial sample generation to train classifiers to distinguish seen from unseen data, achieving moderate success [6]. Li and Li trained a classifier to sniff out adversarial examples by detecting whether the input is from a different distribution [3]. Once detected, the example had a local average filter applied to it to recover the non-adversarial original.

The Knows What It Knows framework allows a learner to avoid making mistakes by deferring judgment to a human some polynomial number of times [2]. However, this framework makes the unrealistically-strong assumption that the data are *i.i.d.*; furthermore, efficient KWIK algorithms are not known for complex hypothesis classes (such as those explored in neural network-based approaches).

## Penalizing Volume

If we want a robust / classifier, we should indeed cleave the space in two, but with the vast majority of the space being allocated to . In other words, we're searching for the smallest, simplest volume which encapsulates the training data.

The most *compact* encapsulation of the training data is a strange, contorted volume only encapsulating the training set. However, we still penalize model complexity, so that shouldn't happen. As new examples are added to the training set, the classifier would have to find a way to expand or contract its class volumes appropriately. This is conceptually similar to version space learning.

This may be advantageous compared to current approaches in that we aren't training an inherently-incorrect classifier to prune itself or to abstain during uncertain situations. Instead, we structure the optimization pressure in such a way that conservative generalization is the norm.

### Formalization

Let be a function returning the proportion of input space volume occupied by non-unknown classes: (set underflow aside for the moment). We need some function which translates this to a reasonable penalty (as the proportion alone would be a rounding error in the overall loss function). For now, assume this function is .

We observe a datum whose ground truth label is . Given loss function , weights , classifier , complexity penalty function , and , the total loss is

Depending on the formulation of , may elect to misclassify a small amount of data to avoid generalizing too far. If more similar examples are added, exerts an increasing amount of pressure on , eventually forcing expansion of the class boundary to the data points in question. This optimization pressure seems desirable.

We then try to minimize expected total loss over the true distribution :

### Tractability

Sufficiently sampling the input space is intractable - the space of RGB images is of size . How do we measure or approximate the volume taken up by classes?

- Random image generation wouldn't be very informative, beyond answering "is this class extending indefinitely along arbitrary dimensions?".
- As demonstrated by Yu et al., adversarial approaches may be able to approximate the hypersurface of the class boundaries [6].
- Bayesian optimization could efficiently deduce the hypersurface of the class boundaries, giving a reasonable estimate of volume.
- The continuous latent space of a variational autoencoder [1] could afford new approximation approaches for . However, this would require assuming that the learned latent space adequately represents both seen and unseen data, which may be exactly the
*i.i.d.*assumption we're trying to escape.

## Future Directions

Extremely small input spaces allow for enumerative calculation of volume. By pitting a volume-penalizing classifier against its vanilla counterpart on such a space, one could quickly gauge the promise of this idea.

If the experimental results support the hypothesis, then the obvious next step is formulating tractable approximations (ideally with provably-bounded error).

## Conclusion

This proposal may be an unbounded robust solution to the open-category classification problem.

One of whom was my excellent deep learning professor.

This approach is inspired by the AI safety mindset in that the classifier strives to be conservative in extrapolating from known data.

## Bibliography

[1] D. P. Kingma and M. Welling. Auto-encoding variational bayes. 2016.

[2] L. Li, M.L. Littman, and T.J. Walsh. Knows what it knows: a framework for self-aware learning. 2008.

[3] X. Li and F. Li. Adversarial examples detection in deep networks with convolutional filter statistics. 2016.

[4] J. Taylor et al. Alignment for advanced machine learning systems. 2016.

[5] J. Taylor. Conservative classifiers. 2016.

[6] Y. Yu et al. Open-category classification by adversarial sample generation. 2017.

Good post.

Hm, this isn't obvious to me. Many training sets do make use of an "unknown"/"none of the above" label, right? It's not an

elegantsolution, but it is a solution.It wouldn't surprise me if including "none of the above" examples in your training set often causes learning algorithms to "do the right thing" and draw boundaries between parts of input space that can/cannot be labeled more precisely. If we're penalizing model complexity, a model which does this is probably going to be simpler than a model which memorizes every "none of the above" training example. In any case, it would be easy to include some "none of the above" examples in your validation set that are much different than the "none of the above" examples in your training set and see if they're classified correctly.

I found this a little bit confusing. It sounds like you've answered your own question. If you have a formula for thingspace volume, you can incorporate it in to your loss function, and then minimizing your loss function will tend to compress thingspace volume as a side effect. Where's the difficulty?

I happened to finish a mobile comment and then refreshed after you posted.

The problem is that this isn't robust in practice, as far as I know. It will bound the size somewhat, but only in the dimensions we happened to specify. The space is then cleaved into three: dog, cat, unknown, with all of them extending (basically) indefinitely. There isn't any pressure for 'unknown' to become the default - we just know that 'dog' is bounded by certain images. Better than nothing, but neither robust nor elegant (which makes me skeptical of its ability to robustly scale).

I think you're right, and I may have come to this conclusion too quickly. I wrote out the equations, and it would in fact be able to optimize off of this. Further thought - I imagine the best way to represent it in the loss equation would be as some function of the proportion of thingspace occupied.

Suppose we're making use of an n-dimensional feature space for classification, and the desired representation for "dog" is an n-dimensional hypersphere centered at some coordinates in this n-dimensional space. Suppose we have a procedure for sampling uniformly at random from a finite region

rsurrounding the dog hypersphere. And suppose that we have a procedure for determining whether a sample from this regionris a dog or not (example: ask someone on Mechanical Turk). If so, it may be possible to offer statistical guarantees about the fidelity of our dog representation based on the number of points sampled, the probability of MTurkers misclassifying images, etc.More informally: If you have enough samples labeled "none of the above", and they have a sufficiently broad distribution, then perhaps you can be fairly sure there are "none of the above" examples which bound your notion of "dog" on any given dimension. This doesn't really have a story for adversarially chosen examples though.

It's true that 256 * 256 RGB images are quite a large feature space, but deep learning algorithms typically transform images in to a much smaller feature space before doing classification. So for this random sampling idea to work, you might need a way to reverse engineer images based on randomly chosen coordinates in the smaller feature space. Then there's the problem of ensuring that the transformation into the smaller feature space consistently behaves as expected. Maybe randomly sampling quite close to the "dog" centroid frequently produces non-dogs when reverse engineered.

I do think your idea of treating "none of the above" as a special class, and regularizing so as to minimize the size of every volume but the "none of the above" volume, is a very interesting one.

I guess a simpler, but probably less effective, change would be to tweak your loss function so as to penalize misclassifying a "none of the above" image as a "dog" more heavily than the reverse. Of course, you could also just have a very high decision threshold for actually treating an image as a dog, but tweaking the loss function might have advantages?

I think that would be a good approach, and more immediately actionable than mine. The hard part is sampling uniformly at random from r, as that implies having already found the desired hypersphere. Also, it seems less resistant to adversarial examples.

Wow, I hadn't thought of using latent spaces for this! If we could have a probabilistic guarantee that our latent space is volumetrically-representative of the space it encodes, and if we had a way of accurately classifying the latent space itself (this seems to follow from the definition of how latent spaces are constructed), then we could randomly sample the continuous latent space in order to get a distribution over volumes! The problem is how do you accurately sample an infinite space, but you could probably get around that by bounding the coordinates to some multiple of the farthest-value-seen-thus-far.

I imagine that a latent space would let us do other cool things, like locate the edges of each class with some confidence (if they exist, and not unlike gradient descent). Therefore, we would have multiple ways of approximating volume. However, I don't think I'm familiar enough with them yet to speak confidently and technically about the subject (my class is just reaching autoencoders now).

This seems similar to the r-sampling idea, but in a way which converges more quickly. I still think the issue is guaranteeing robustness, and finding that ideal r to begin with.

Meta question: how can I do strikethrough in my posts? Tildes don't do the trick.

Well in reality, if you are paying people on Mechanical Turk to classify your images, maybe you don't want to sample randomly anyhow. Instead you could select maximally informative data points to ask them about.

This potentially helps with the problem of discovering the bounding region. Suppose that one of the features in the transformed space corresponds to shagginess. And suppose that the shaggiest image in our training set is an image of a dog. A naive learning algorithm might conclude that an image full of shag must be a dog. To deal with this problem, we set shagginess to 10, generate an image, and send it to MTurk. If they think it's a dog, we double our shagginess. If they think it's not a dog, we halve our shagginess. (For this use case, it might be best to ask them to describe the image in a single word... if they're choosing between dog/cat/other, they might select dog on the basis that it looks kinda like dog hair or something like that.) Eventually we get some idea of where the classification boundary should be through binary search.

I'll bet you could do some math to determine how to get the strongest statistical guarantees with the minimum amount of money spent on MTurk too.

Yep. If the dog is represented using a convex polytope instead of a sphere, you might even reverse engineer the corners of your current classifier region, and then display them all to the user to show how expansive the classifier's notion of "dog" is. But the map is not the territory: It's possible that in some cases, the shape the user wants is actually concave.

I'm a deep learning noob too. I'm just about finished with Andrew Ng's Coursera specialization, which was great, but the word "autoencoder" was never used. However there was some discussion of making use of transformed ("latent"? Staying on the safe side because I'm not familiar with that term) feature spaces. Apparently this is how face recognition systems recognize your face given only a single reference image: Map the reference image into a carefully constructed feature space, then map a new image of you in to the same feature space and compute the Euclidean distance. If the distance is small enough, it's a match.

In this case, information is measured by how much of thingspace would be sheared if it turned out that a data point should be classified as 'unknown'. It isn't immediately clear how to find this without a tractable thingspace-volume-subroutine, but I think this would be computationally-efficient for both of our ideas.

The technique you're probably looking for is called Bayesian Optimization. Aside: at my school, 'Optimization' - not 'Conspiracy' - is unfortunately the word which most frequently follows 'Bayesian'.

Even an imperfect estimate of the volume would be useful: for example, perhaps we only find some of the edges and conclude the volume is some fraction of its true value. I have the distinct sense of talking past the point you were trying to make, though.

No, that sounds more or less right.

I've substantially revised this post and would appreciate any feedback!