Using fancy tools like neural nets, boosting and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a bandaid.
For some reason, statistics always seemed somewhat disjoint from the rest of math, more akin to a bunch of tools than a rigorous, carefully-constructed framework. I am here to atone for my foolishness.
This academic term started with a jolt - I quickly realized that I was missing quite a few prerequisites for the Bayesian Statistics course in which I had enrolled, and that good ol' AP Stats wasn't gonna cut it. I threw myself at All of Statistics, doing a good number of exercises, dissolving confusion wherever I could find it, and making sure I could turn each concept around and make sense of it from multiple perspectives.
I then went even further, challenging myself during the bits of downtime throughout my day to do things like explain variance from first principles, starting from the sample space, walking through random variables and expectation - without help.
All of Statistics
In which sample spaces are formalized.
3: Random Variables
In which random variables are detailed and a multitude of distributions are introduced.
Consider that a random variable is a function . For random variables , we can then produce conjugate random variables , with
is conservation of expected evidence (thanks to Alex Mennen for making this connection explicit).
Why does marginal variance have two terms? Shouldn't the expected conditional variance be sufficient?
This literally plagued my dreams.
Proof (of the variance; I cannot prove it plagued my dreams):
The middle term is eliminated as the expectations cancel out after repeated applications of conservation of expected evidence. Another way to look at the last two terms is the sum of the expected sample variance and the variance of the expectation.
When calculating variance from observations , you might think to write
where is the sample mean. However, this systematically underestimates the actual sample variance, as the sample mean is itself often biased (as demonstrated above). The corrected sample variance is thus
In which the author provides instrumentally-useful convergence results; namely, the law of large numbers and the central limit theorem.
Equality of Continuous Variables
For continuous random variables , we have , which is surprising. In fact, for , as well!
The continuity is the culprit. Since the cumulative density functions are continuous, the limit of the density allotted to any given point is 0. Read more here.
Types of Convergence
Let be a sequence of random variables, and let be another random variable. Let denote the CDF of , and let denote the CDF of .
converges to in probability, written , if, for every , as .
Random variables are functions , assigning a number to each possible outcome in the sample space . Considering this fact, two random variables converge in probability when their assigned values are "far apart" (greater than ) with probability 0 in the limit.
converges to in distribution, written , if at all for which is continuous.
A similar geometric intuition:
Note: the continuity requirement is important. Imagine we distribute points uniformly on ; we see that . However, is 0 when , but . Thus CDF convergence does not occur at .
In Quadratic Mean
converges to in quadratic mean, written , if as .
The expectation of the quadratic mean approaches 0; in contrast to convergence in probability, dealing with expectation means that values of highly deviant with respect to