How to pick your categories

[-]xamdam15y240

The descriptive math part was very good, thanks - and that's why I resisted downvoting the post. My problem is that the conclusion omits the hugely important factor that categories are useful for specific goals, and the kind of techniques you are suggesting (essentially unsupervised techniques) are context-free.

E.g. is a dead cow more similar to a dead (fixed from 'live') horse or to a live cow? (It clearly depends what you want to do with it)

[-][anonymous]15y60

That's a good point.

I tend to find techniques attractive when they're generalizable, and context-free, unsupervised techniques fit the bill. You can automate them. You can apply them to a range of projects. But you're right -- sometimes specific knowledge about a specific application matters, and you can't generalize it away.

[-]sketerpot15y40

The two aren't mutually exclusive, of course. You can use specific knowledge about a particular problem to make your machine learning methods work better, sometimes.

I read a paper the other day about predicting the targets of a particular type of small nucleolar RNA, which are an important part of the machinery that regulates gene expression. One of the methods they used was to run an SVM classifier on a number of features about the RNA in question. SVM classifiers are one of those nice general-purpose easily-automated methods, but the authors used their knowledge of the specific problem to pick out what features it would use for its classification. Things like the length of particular parts of the RNA -- stuff that would occur to molecular biologists, but could be prohibitively expensive for a purely automatic machine learning algorithm to discover if you just gave it all the relevant data.

(More bio nerdery: they combined this with a fast approximation of the electrostatic forces at work, and ended up getting remarkably good accuracy and speed. The paper is here, if anyone's interested.)

[-][anonymous]15y80

Belatedly, I remembered a relevant tidbit of wisdom I once got from a math professor.

When a theorist comes up with a new algorithm, it's not going to outperform existing algorithms used commercially in the "real world." Not even if, in principle the new algorithm is more elegant or faster or whatever. Why? Because in the real world, you don't just take a general-purpose algorithm off the page, you optimize the hell out of it. Engineers who work with airplanes will jimmy their algorithms to accommodate all the practical "common knowledge" about airplanes. A mere mathematician who doesn't know anything about airplanes can't compete with that.

If you're a theorist trying to come up with a better general method your goal is to give evidence that your algorithm will do better than the existing one after you optimize the hell out of them equally.

[-]kpreid15y20

Your example doesn't quite make sense to me. Did you mean “is a dead cow more similar to a dead horse or to a live cow”, or ...?

[-]thomblake15y200

When you play Twenty Questions with the universe, some questions are more useful than others.

Very quotable

[-]sketerpot15y30

And it brings to mind decision trees, which are essentially an automated way of playing Twenty Questions with the universe. In order to avoid over-fitting your training data, once you've constructed a complete decision tree, you go back and prune it, removing questions that are below a certain threshold of usefulness.

The usual way you do this is, you look at the expected reduction in entropy from asking a particular question. If it doesn't reduce the entropy much, don't bother asking. If you know that an animal is a bird, you don't gain much by asking "Is it an Emperor penguin?". You would reduce the entropy in your pool of possible birds more by asking if it's a songbird, or if its average adult wingspan is more than 10 cm.

SarahC's quote is not only clever, but also supported by solid math and practical application.

[-]AdShea15y130

Overall you did a great job explaining the mathematics of unsupervised categorization but you missed one point in your end-matter.

The initial Monera classification was not a bad category at the time because when it was created there wasn't enough data to split out different subcategories. All the researchers had were a bunch of fuzzy spots wiggling under a microscope. You touched on this in the Amazon.com example. Just because the categories you have now are good for your current data doesn't mean that they will remain the same with further data.

[-][anonymous]15y50

Fair enough.

[-]HughRistik15y50

For example, I once saw a study of autism that did the following: created a questionnaire that rated the user's "empathizing" and "systematizing" qualities, found that autistics were less "empathizing" and more "systematizing" than non-autistics, and concluded that autism was defined by more systematizing and less empathizing.

This was Simon Baron-Cohen's EQ-SQ research. I don't remember exactly what they concluded.

[-][anonymous]15y70

yep, that's it. I'm a little nervous about dissing a famous researcher, but I did read the paper and it didn't seem right. It was definitely phrased as if the correlation between autism and the results of various questionnaires vindicated modeling autism as a empathizing-systematizing spectrum. I'm not saying I disagree with that model personally (how would I know?) but that it's not good enough justification to do it that way.

Of course, to be fair to Baron-Cohen, I've read one paper of his, not his entire body of work; if he fills in the gap elsewhere, more power to him. In that case, my example of "bad research" would be fictional (but still, I believe, bad.)

[-]JoshuaZ15y50

Another related problem that might be worth talking about is overfitting. Deciding what is noise and what isn't noise and the related problem of how finely grained categories should be can be tough.

[-][anonymous]15y40

Yes! Whole new issue! Multi-scale methods can help (I like this example a lot) but I think there's still a little ad-hoc stuff hidden within the algorithm.

[-]InquilineKea15y30

Very interesting points. I'm still trying to learn the math behind categorization myself.

Regarding the autism categorizations - good points. It's also quite possible that autistics might score lower on the systematizing quotient than non-autistics in a different country/world. How could that happen? The questions on that systematizing quotient test were highly subject specific - some of the questions had to do with furniture, others had to do with time tables, others had to do with statistics. The type of person who scores high on systematizing would have to have broad interests. An autistic with exceptionally narrow interests would score very low on this (even though his interests could still have especially high intensity - the intensity and the narrowness both owing themselves to autism).

But it's quite possible that an autistic person could have obsessions with entirely different domains that don't appear on the systematizing quotient test - domains that were more salient in a different culture/world.

In another example, I'll bring up this hypothesis: http://en.wikipedia.org/wiki/Differential_susceptibility_hypothesis http://www.theatlantic.com/magazine/archive/2009/12/the-science-of-success/7761/

So it's entirely possible that someone with a particular genotype could exhibit one phenotype in one environment, and the exact opposite phenotype in the second environment. How would they then be classified? As according to their genotype? Well, maybe. But in America, the total scope of environmental variation is highly restricted (almost no one suffers from extreme starvation). Environmental variation could be significantly increased through extreme environmental circumstances, or even by cyborg technology. After we use this - how can we then classify people?

Here's a post I once wrote on classification: http://forums.philosophyforums.com/threads/classification-theory-29482.html

One of my major points: Even the "Tree of Life" isn't strictly a "tree of life". Humans owe 8% of their DNA to some rhinovirus (IIRC). It's entirely possible that in a world of increased viral activity, that the "tree" would totally break down (in fact, there probably is no "tree" in the bacterial kingdoms).

And of course, then if we implement cyborg technology (or artificial DNA) into the bacteria - it makes classification even more complicated. We could compare differences in letter groups in DNA. But what if the artificial genome had different molecules that made up a helix?

[-]Psy-Kosh15y20

Possibly stupid thought, but rather than just applying these techniques to the whole of thingspace, shouldn't one first do something like this: get a bunch of incoming data, create density map (by measuring at various resolutions, perhaps), look for regions with density significantly different from others (ie, first locate some clusters) and then apply these techniques in those regions rather than over the entire whole of the space to help learn the properties of those clusters?

[-]taw15y20

Evidently, they have a big adjacency matrix somewhere, one column for every customer and one row for every purchase, and quite possibly they're running some sort of a graph diffusion on it.

My information might be out of date, but last time I checked they had one such matrix per department, supposedly with some timing information etc. So they won't recommend Bolt plush toy if you liked Bolt DVD.

I'll leave the moral of this story to you.

[-]Jordan15y20

Awesome, thanks for posting this.

I've been working with SVDs a lot in my research recently, and have gotten interested in manifold learning as a result.

What are your data sets like and how well has your manifold learning algorithms worked on them? I get the feeling that, for instance, Local Linear Embedding wouldn't do so well factoring out rotation in computer vision.

[-]Manax15y20

I spent some time learning about this, when I was dabbling with the Netflix contest. There was a fair bit of discussion on their forum regarding SVD & related algorithms. The winner used a "blended" approach, which weighted results from numerous different algorithms. The success of the algorithm was based on the RMSE of the predictions compared with a test set of data.

[-]imonroe15y20

While the math is a little outside my current capabilities, I really appreciate this thread, because I've been working on the very beginning stages of a project that requires computational categorization algorithms, and you've given me a lot of good information, and perhaps more importantly, some new things to go and study.

Thanks!

[-][anonymous]15y-10

How does one obtain a high-dimensional featurespace to begin with? Can one bootstrap from a one-dimensional space?

I can't think of any satisfactory way to do this right now.

[-][anonymous]15y120

You shouldn't want to have a high-dimensional space. High-dimensional spaces are hard to work with, it's just that they come up often. You basically obtain one when you look at an object or concept or what have you, then think of everything you could measure about it and measure that.

[-]sketerpot15y40

Misha's answer is almost always the right one, but you technically can project points into a higher-dimensional space using a kernel function. This comes up in Support Vector Machines where you're trying to separate two classes of data points by drawing a hyperplane between them. If your data isn't linearly separable, projecting it into a higher-dimensional space can sometimes help.

But most of the time, what you want to so is just measure everything you can think of, and let those measurements be your dimensions. When looking at rubes and bleggs, measure things like redness, blueness, roundedness, furryness, whatever you can think of. Each of those is one dimension. Before you know it, you've got a high-dimensional featurespace. Good luck dealing with it.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

78

How to pick your categories

78

78