Good post. Some feedback:
I think you can replace the first instance of "are statistically independent" with "are statistically independent and identically distributed" & improve clarity.
IMO, your argument needs work if you want it to be more than an intuition pump. If the question is the existence or nonexistence of particular clusters, you are essentially assuming what you need to prove in this post. Plus, the existence or nonexistence of clusters is a "choice of ontology" question which doesn't necessarily have a single correct answer.
You're also fuzzing things by talking about discrete distributions here, then linking to Eliezer's discussion of continuous latent variables ("intelligence") without noting the difference. And: If a number of characteristics have been observed to co-vary, this isn't sufficient evidence for any particular causal mechanism. Correlation isn't causation. As I pointed out in this essay, it's possible there's some latent factor like the ease of obtaining calories in an organism's environment which explains interspecies intelligence differences but doesn't say anything about the "intelligence" of software.
replace the first instance of "are statistically independent" with "are statistically independent and identically distributed"
Done, thanks!
talking about discrete distributions here, then linking to Eliezer's discussion of continuous latent variables ("intelligence") without noting the difference
The difference doesn't seem relevant to the narrow point I'm trying to make? I was originally going to use multivariate normal distributions with different means, but then decided to just make up "peaked" discrete distributions in order to keep the arithmetic simple.
I agree with your other two points (mostly - I don't feel that the distinction between discrete and continuous variables is important for Zack's argument so it seems fine to gloss over it) but I disagree with the first.
In order to be able to simply multiply likelihood ratios, the sufficient fact is that they're statistically independent. In this toy model, they also happen to be identically distributed, but I think it's clear from context that Zack would like to apply his argument to a variety of situations where the different dimensions have different distributions. You're suggesting replacing "X, therefore Z" with "X and Y, therefore Z", when in fact X->Z, and it is not the case that Y->Z.
Hi Zack,
Can you clarify something? In the picture you draw, there is a codimension-1 linear subspace separating the parameter space into two halves, with all red points to one side, and all blue points to the other. Projecting onto any 1-dimensional subspace orthogonal to this (there is a unique one through the origin) will thus yield a `variable' which cleanly separates the two points into the red and blue categories. So in the illustrated example, it looks just like a problem of bad coordinate choice.
On the other hand, one can easily have much more pathological situations; for examples, the red points could all lie inside a certain sphere, and the blue points outside it. Then no choice of linear coordinates will illustrate this, and one has to use more advanced analysis techniques to pick up on it (e.g. persistent homology).
So, to my vague question: do you have only the first situation in mind, or are you also considering the general case, but made the illustrated example extra-simple?
Perhaps this is clarified by your numerical example, I'm afraid I've not checked.
Projecting onto any 1-dimensional subspace orthogonal to this (there is a unique one through the origin) will thus yield a 'variable' which cleanly separates the two points into the red and blue categories. So in the illustrated example, it looks just like a problem of bad coordinate choice.
Thanks, this is a really important point! Indeed, for freely-reparametrizable abstract points in an abstract vector space, this is just a bad choice of coordinates. The reason this objection doesn't make the post completely useless, is that for some applications (you know, if you're one of those weird people who cares about "applications"), we do want to regard some bases as more "fundamental", if the variables represent real-world measurements.
For example, you might be able to successfully classify two different species of flower using both "stem length" and "petal color" measurements, even if the distributions overlap for either stem length or petal color considered individually. Mathematically, we could view the distributions as not overlapping with respect to some variable that corresponds to some weighted function of stem length and petal color, but that variable seems "artificial", less "interpretable."
Another way to succinctly say this is that two distributions may be cleanly separable via a single immeasurable variable, but overlap when measured on any given measurable variable, such that a representation of the separation achieved by a single immeasurable variable is only achievable through multiple measurable variables.
Thanks for the reply, Zack.
The reason this objection doesn't make the post completely useless...
Sorry, I hope I didn't suggest I thought that! You make a good point about some variables being more natural in given applications. I think it's good to keep in mind that sometimes it's just a matter of coordinate choice, and other times the points may be separated but not in a linear way.
Sorry, I hope I didn't suggest I thought that!
I mean, it doesn't matter whether you think it, right? It matters whether it's true. Like, if I were to were to write a completely useless blog post on account of failing to understand the concept of a change of basis, then someone should tell me, because that would be helping me stop being deceived about the quality of my blogging.
FYI, one of the symbols in this post is not rendering properly. It appears to be U+20D7 COMBINING RIGHT ARROW ABOVE (appearing right after the ‘x’ characters) but, at least on this machine (Mac OS 10.11.6, Chrome 74.0.3729.131), it renders as a box:
It is probably a good idea to use LaTeX to encode such symbols.
UPDATE: It does work properly in Firefox 67.0.2 (on the same machine):
Thanks for the bug report; I edited the post to use LaTeX \vec{x}
. (The combining arrow worked for me on Firefox 67.0.1 and was kind-of-ugly-but-definitely-renders on Chromium 74.0.3729.169, on Xubuntu 16.04)
It is probably a good idea to use LaTeX to encode such symbols.
I've been doing this thing where I prefer to use "plain" Unicode where possible (where, e.g., the subscript in "x₁" is 0x2081 SUBSCRIPT ONE) and only resort to "fancy" (and therefore suspicious) LaTeX when I really need it, but the reported Chrome-on-macOS behavior does slightly alter my perception of "really need it."
I’ve been doing this thing where I prefer to use “plain” Unicode where possible
I entirely sympathize with this preference!
Unfortunately, proper rendering of Unicode depends on the availability of the requisite characters in the fallback fonts available in a user’s OS/client combination (which vary unpredictably). This means that the more exotic code points cannot be relied on to properly render with acceptable consistency.
Now, that having been said, and availability and proper rendering aside, I cannot endorse your use of such code points as U+2081 SUBSCRIPT ONE. Such typographic features as subscripts ought properly to be encoded via OpenType metadata^{[1]}, not via Unicode (and indeed I consider the existence of these code points to be a necessary evil at best, and possibly just a bad idea). In the case where OpenType metadata editing^{[2]} is not available, the proper approach is either LaTeX, or “low-tech” approximations such as brackets.
Which, in turn, ought to be generated programmatically from, e.g., HTML markup (or even higher-level markup languages like Markdown or wiki markup), rather than inserted manually. This is because the output generation code must be able to decide whether to use OpenType metadata or whether to instead use lower-level approaches like the HTML+CSS layout system, etc., depending on the capabilities of the output medium in any given case. ↩︎
That is, the editing of the requisite markup that will generate the proper OpenType metadata; see previous footnote. ↩︎
I'm almost sure I saw a Wikipedia article about this back in the mid 2000s with a 2D version of your plot, but I can't find anything relevant in either https://en.wikipedia.org/wiki/List_of_fallacies#Statistical_fallacies or https://en.wikipedia.org/wiki/List_of_paradoxes#Statistics ... did I just dream of it?
(A standalone math post that I want to be able to link back to later/elsewhere)
There's this statistical phenomenon where it's possible for two multivariate distributions to overlap along any one variable, but be cleanly separable when you look at the entire configuration space at once. This is perhaps easiest to see with an illustrative diagram—
The denial of this possibility (in arguments of the form, "the distributions overlap along this variable, therefore you can't say that they're different") is sometimes called the "univariate fallacy." (Eliezer Yudkowsky proposes "covariance denial fallacy" or "cluster erasure fallacy" as potential alternative names.)
Let's make this more concrete by making up an example with actual numbers instead of just a pretty diagram. Imagine we have some datapoints that live in the forty-dimensional space {1, 2, 3, 4}⁴⁰ that are sampled from one of two probability distibutions, which we'll call PA and PB.
For simplicity, let's suppose that the individual variables x₁, x₂, ... x₄₀—the coördinates of a point in our forty-dimensional space—are statistically independent and identically distributed. For every individual xi, the marginal distribution of PA is—
PA(xi)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1/4xi=17/16xi=21/4xi=31/16xi=4
And for PB—
PB(xi)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1/16xi=11/4xi=27/16xi=31/4xi=4
If you look at any one xi-coördinate for a point, you can't be confident which distribution the point was sampled from. For example, seeing that x₁ takes the value 2 gives you a 7/4 (= 1.75) likelihood ratio in favor of that the point having been sampled from PA rather than PB, which is log₂(7/4) ≈ 0.807 bits of evidence.
That's ... not a whole lot of evidence. If you guessed that the datapoint came from PA based on that much evidence, you'd be wrong about 4 times out of 10. (Given equal (1:1) prior odds, an odds ratio of 7:4 amounts to a probability of (7/4)/(1 + 7/4) ≈ 0.636.)
And yet if we look at many variables, we can achieve supreme, godlike confidence about which distribution a point was sampled from. Proving this is left as an exercise to the particularly intrepid reader, but a concrete demonstration is probably simpler and should be pretty convincing! Let's write some Python code to sample a point →x ∈ {1, 2, 3, 4}⁴⁰ from PA—
Go ahead and run the code yourself. (With an online REPL if you don't have Python installed locally.) You'll probably get a value of
x
that "looks something like"If someone off the street just handed you this →x without telling you whether she got it from PA or PB, how would you compute the probability that it came from PA?
Well, because the coördinates/variables are statistically independent, you can just tally up (multiply) the individual likelihood ratios from each variable. That's only a little bit more code—
If you run that code, you'll probably see "something like" this—
Our computed probability that →x came from PA has several nines in it. Wow! That's pretty confident!
Thanks for reading!