Using fancy tools like neural nets, boosting and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a bandaid.

Larry Wasserman

Foreword

For some reason, statistics always seemed somewhat disjoint from the rest of math, more akin to a bunch of tools than a rigorous, carefully-constructed framework. I am here to atone for my foolishness.

This academic term started with a jolt - I quickly realized that I was missing quite a few prerequisites for the Bayesian Statistics course in which I had enrolled, and that good ol' AP Stats wasn't gonna cut it. I threw myself at All of Statistics, doing a good number of exercises, dissolving confusion wherever I could find it, and making sure I could turn each concept around and make sense of it from multiple perspectives.

I then went even further, challenging myself during the bits of downtime throughout my day to do things like explain variance from first principles, starting from the sample space, walking through random variables and expectation - without help.

All of Statistics

1: Introduction

2: Probability

In which sample spaces are formalized.

3: Random Variables

In which random variables are detailed and a multitude of distributions are introduced.

Conjugate Variables

Consider that a random variable X is a function X:Ω→R. For random variables X,Y, we can then produce conjugate random variables XY,X+Y, with

The middle term is eliminated as the expectations cancel out after repeated applications of conservation of expected evidence. Another way to look at the last two terms is the sum of the expected sample variance and the variance of the expectation.

Bessel's Correction

When calculating variance from observations X1,…,Xn, you might think to write

S2n=1nn∑i=1(Xi−¯Xn)2,

where ¯Xn is the sample mean. However, this systematically underestimates the actual sample variance, as the sample mean is itself often biased (as demonstrated above). The corrected sample variance is thus

For continuous random variables X,Y, we have P(X=Y)=0, which is surprising. In fact, for xi∼X,yi∼Y, P(xi=yi)=0 as well!

The continuity is the culprit. Since the cumulative density functions FX,FY are continuous, the limit of the density allotted to any given point is 0. Read more here.

Types of Convergence

Let X1,X2,… be a sequence of random variables, and let X be another random variable. Let Fn denote the CDF of Xn, and let F denote the CDF of X.

In Probability

Xn converges to X in probability, written Xnp→X, if, for every ϵ>0, P(|Xn−X|>ϵ)→0 as n→∞.

Random variables are functions Y:Ω→R, assigning a number to each possible outcome in the sample space Ω. Considering this fact, two random variables converge in probability when their assigned values are "far apart" (greater than ϵ) with probability 0 in the limit.

Xn converges to X in distribution, written Xn⇝X, if limn→∞Fn(t)=F(t) at all t for which F is continuous.

Fairly straightforward.

A similar1 geometric intuition:

Note: the continuity requirement is important. Imagine we distribute points uniformly on (0,1n); we see that Xn⇝0. However, Fn is 0 when x≤0, but F(0)=1. Thus CDF convergence does not occur at x=0.

In Quadratic Mean

Xn converges to X in quadratic mean, written Xnqm→X, if E((Xn−X)2)→0 as n→∞.

The expectation of the quadratic mean approaches 0; in contrast to convergence in probability, dealing with expectation means that values of Xn highly deviant with respect to

## Foreword

For some reason, statistics always seemed somewhat disjoint from the rest of math, more akin to a bunch of tools than a rigorous, carefully-constructed framework. I am here to atone for my foolishness.

This academic term started with a jolt - I quickly realized that I was missing quite a few prerequisites for the Bayesian Statistics course in which I had enrolled, and that good ol' AP Stats wasn't gonna cut it. I threw myself at

All of Statistics, doing a good number of exercises, dissolving confusion wherever I could find it, and making sure I could turn each concept around and make sense of it from multiple perspectives.I then went even further, challenging myself during the bits of downtime throughout my day to do things like

explain variance from first principles, starting from the sample space, walking through random variables and expectation - without help.## All of Statistics

## 1: Introduction

## 2: Probability

In which sample spaces are formalized.## 3: Random Variables

In which random variables are detailed and a multitude of distributions are introduced.## Conjugate Variables

Consider that a random variable X is a function X:Ω→R. For random variables X,Y, we can then produce conjugate random variables XY,X+Y, with

## 4: Expectation

## Evidence Preservation

is conservation of expected evidence (thanks to Alex Mennen for making this connection explicit).

## Marginal Variance

This literally plagued my dreams.

Proof (of the variance; I cannot prove it plagued my dreams):The middle term is eliminated as the expectations cancel out after repeated applications of conservation of expected evidence. Another way to look at the last two terms is the sum of the expected sample variance and the variance of the expectation.

## Bessel's Correction

When calculating variance from observations X1,…,Xn, you might think to write

where ¯Xn is the sample mean. However, this systematically underestimates the actual sample variance, as the sample mean is itself often biased (as demonstrated above). The corrected sample variance is thus

See Wikipedia.

## 5: Inequalities

## 6: Convergence

In which the author provides instrumentally-useful convergence results; namely, the law of large numbers and the central limit theorem.## Equality of Continuous Variables

For continuous random variables X,Y, we have P(X=Y)=0, which is surprising. In fact, for xi∼X,yi∼Y, P(xi=yi)=0 as well!

The continuity is the culprit. Since the cumulative density functions FX,FY are continuous, the limit of the density allotted to any given point is 0. Read more here.

## Types of Convergence

## In Probability

Random variables are functions Y:Ω→R, assigning a number to each possible outcome in the sample space Ω. Considering this fact, two random variables converge in probability when their assigned values are "far apart" (greater than ϵ) with probability 0 in the limit.

See here.

## In Distribution

Fairly straightforward.

A similar1 geometric intuition:

Note:the continuity requirement is important. Imagine we distribute points uniformly on (0,1n); we see that Xn⇝0. However, Fn is 0 when x≤0, but F(0)=1. Thus CDF convergence does not occur at x=0.## In Quadratic Mean

The expectation of the quadratic mean approaches 0; in contrast to convergence in probability, dealing with expectation means that values of Xn highly deviant with respect to