LESSWRONG
LW

630
Ethan (EJ) Watkins
0030
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Six (and a half) intuitions for KL divergence
Ethan (EJ) Watkins2y10

By the law of large numbers, 1N∑Ni=1lnQθ(xi)→∑xP(x)lnQθ(x) almost surely. This is the cross entropy of P and Qθ. Also note that if we subtract this from the entropy of P, we get DKL(P||Qθ). So minimising the cross entropy over θ is equivalent to maximising DKL(P||Qθ).

I think the cross entropy of P and Qθ is actually H(P,Qθ)=−∑xP(x)lnQθ(x) (note the negative sign). The entropy of P is H(P)=−∑xP(x)lnP(x). Since DKL(P||Qθ)=∑xP(x)(ln(P(x)−lnQθ(x))=∑xP(x)lnP(x)−∑xP(x)lnQθ(x)=−H(P)+H(P,Qθ)then the KL divergence is actually the cross entropy minus the entropy, not the other way around. So minimising the cross entropy over θ will minimise (not maximise) the KL divergence.

I believe the next paragraph is still correct: the maximum likelihood estimator θ∗ is the parameter which maximises L(^Pn;Qθ), which minimises the cross-entropy, which minimises the KL divergence.

Apologies if any of what I've said above is incorrect, I'm not an expert on this.

Reply
Six (and a half) intuitions for KL divergence
Ethan (EJ) Watkins2y10

DKL(P||Q)=∑xpx(lnpx−lnqx)=E[IP(X)−IQ(X)]

I think there is a mistake in this equation. IP(X) and IQ(X) are the wrong way round. It should be:
DKL(P||Q)=∑xpx(lnpx−lnqx)=E[IQ(X)−IP(X)]
 

Reply