Ethan (EJ) Watkins — LessWrong

By the law of large numbers, almost surely. This is the cross entropy of $P$ and $Q_{θ}$ . Also note that if we subtract this from the entropy of $P$ , we get $D_{K L} (P | | Q_{θ})$ . So minimising the cross entropy over $θ$ is equivalent to maximising $D_{K L} (P | | Q_{θ})$ .

I think the cross entropy of $P$ and $Q_{θ}$ is actually $H (P, Q_{θ}) = - \sum_{x} P (x) ln Q_{θ} (x)$ (note the negative sign). The entropy of $P$ is $H (P) = - \sum_{x} P (x) ln P (x)$ . Since $D_{K L} (P | | Q_{θ}) = \sum_{x} P (x) (ln (P (x) - ln Q_{θ} (x)) = \sum_{x} P (x) ln P (x) - \sum_{x} P (x) ln Q_{θ} (x) = - H (P) + H (P, Q_{θ})$ then the KL divergence is actually the cross entropy minus the entropy, not the other way around. So minimising the cross entropy over $θ$ will minimise (not maximise) the KL divergence.

I believe the next paragraph is still correct: the maximum likelihood estimator $θ^{*}$ is the parameter which maximises $L ({^P}_{n}; Q_{θ})$ , which minimises the cross-entropy, which minimises the KL divergence.

Apologies if any of what I've said above is incorrect, I'm not an expert on this.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments