Using fancy tools like neural nets, boosting and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a bandaid.
Larry Wasserman
Foreword
For some reason, statistics always seemed somewhat disjoint from the rest of math, more akin to a bunch of tools than a rigorous, carefully-constructed framework. I am here to atone for my foolishness.
This academic term started with a jolt - I quickly realized that I was missing quite a few prerequisites for the Bayesian Statistics course in which I had enrolled, and that good ol' AP Stats wasn't gonna cut it. I threw myself at All of Statistics, doing a good number of exercises, dissolving confusion wherever I could find it, and making sure I could turn each concept around and make sense of it from multiple perspectives.
I then went even further, challenging myself during the bits of downtime throughout my day to do things like explain variance from first principles, starting from the sample space, walking through random variables and expectation - without help.
All of Statistics
1: Introduction
2: Probability
In which sample spaces are formalized.
3: Random Variables
In which random variables are detailed and a multitude of distributions are introduced.
Conjugate Variables
Consider that a random variable X is a function X:Ω→R. For random variables X,Y, we can then produce conjugate random variables XY,X+Y, with
The middle term is eliminated as the expectations cancel out after repeated applications of conservation of expected evidence. Another way to look at the last two terms is the sum of the expected sample variance and the variance of the expectation.
Bessel's Correction
When calculating variance from observations X1,…,Xn, you might think to write
S2n=1nn∑i=1(Xi−¯Xn)2,
where ¯Xn is the sample mean. However, this systematically underestimates the actual sample variance, as the sample mean is itself often biased (as demonstrated above). The corrected sample variance is thus
For continuous random variables X,Y, we have P(X=Y)=0, which is surprising. In fact, for xi∼X,yi∼Y, P(xi=yi)=0 as well!
The continuity is the culprit. Since the cumulative density functions FX,FY are continuous, the limit of the density allotted to any given point is 0. Read more here.
Types of Convergence
Let X1,X2,… be a sequence of random variables, and let X be another random variable. Let Fn denote the CDF of Xn, and let F denote the CDF of X.
In Probability
Xn converges to X in probability, written Xnp→X, if, for every ϵ>0, P(|Xn−X|>ϵ)→0 as n→∞.
Random variables are functions Y:Ω→R, assigning a number to each possible outcome in the sample space Ω. Considering this fact, two random variables converge in probability when their assigned values are "far apart" (greater than ϵ) with probability 0 in the limit.
Xn converges to X in distribution, written Xn⇝X, if limn→∞Fn(t)=F(t) at all t for which F is continuous.
Fairly straightforward.
A similar1 geometric intuition:
Note: the continuity requirement is important. Imagine we distribute points uniformly on (0,1n); we see that Xn⇝0. However, Fn is 0 when x≤0, but F(0)=1. Thus CDF convergence does not occur at x=0.
In Quadratic Mean
Xn converges to X in quadratic mean, written Xnqm→X, if E((Xn−X)2)→0 as n→∞.
The expectation of the quadratic mean approaches 0; in contrast to convergence in probability, dealing with expectation means that values of Xn highly deviant with respect to
Foreword
For some reason, statistics always seemed somewhat disjoint from the rest of math, more akin to a bunch of tools than a rigorous, carefully-constructed framework. I am here to atone for my foolishness.
This academic term started with a jolt - I quickly realized that I was missing quite a few prerequisites for the Bayesian Statistics course in which I had enrolled, and that good ol' AP Stats wasn't gonna cut it. I threw myself at All of Statistics, doing a good number of exercises, dissolving confusion wherever I could find it, and making sure I could turn each concept around and make sense of it from multiple perspectives.
I then went even further, challenging myself during the bits of downtime throughout my day to do things like explain variance from first principles, starting from the sample space, walking through random variables and expectation - without help.
All of Statistics
1: Introduction
2: Probability
In which sample spaces are formalized.
3: Random Variables
In which random variables are detailed and a multitude of distributions are introduced.
Conjugate Variables
Consider that a random variable X is a function X:Ω→R. For random variables X,Y, we can then produce conjugate random variables XY,X+Y, with
4: Expectation
Evidence Preservation
is conservation of expected evidence (thanks to Alex Mennen for making this connection explicit).
Marginal Variance
This literally plagued my dreams.
Proof (of the variance; I cannot prove it plagued my dreams):
The middle term is eliminated as the expectations cancel out after repeated applications of conservation of expected evidence. Another way to look at the last two terms is the sum of the expected sample variance and the variance of the expectation.
Bessel's Correction
When calculating variance from observations X1,…,Xn, you might think to write
where ¯Xn is the sample mean. However, this systematically underestimates the actual sample variance, as the sample mean is itself often biased (as demonstrated above). The corrected sample variance is thus
See Wikipedia.
5: Inequalities
6: Convergence
In which the author provides instrumentally-useful convergence results; namely, the law of large numbers and the central limit theorem.
Equality of Continuous Variables
For continuous random variables X,Y, we have P(X=Y)=0, which is surprising. In fact, for xi∼X,yi∼Y, P(xi=yi)=0 as well!
The continuity is the culprit. Since the cumulative density functions FX,FY are continuous, the limit of the density allotted to any given point is 0. Read more here.
Types of Convergence
In Probability
Random variables are functions Y:Ω→R, assigning a number to each possible outcome in the sample space Ω. Considering this fact, two random variables converge in probability when their assigned values are "far apart" (greater than ϵ) with probability 0 in the limit.
See here.
In Distribution
Fairly straightforward.
A similar1 geometric intuition:
Note: the continuity requirement is important. Imagine we distribute points uniformly on (0,1n); we see that Xn⇝0. However, Fn is 0 when x≤0, but F(0)=1. Thus CDF convergence does not occur at x=0.
In Quadratic Mean
The expectation of the quadratic mean approaches 0; in contrast to convergence in probability, dealing with expectation means that values of Xn highly deviant with respect to