Posts

Sorted by New

Wiki Contributions

Comments

By the law of large numbers,  almost surely. This is the cross entropy of  and . Also note that if we subtract this from the entropy of , we get . So minimising the cross entropy over  is equivalent to maximising .

I think the cross entropy of  and  is actually  (note the negative sign). The entropy of  is . Since then the KL divergence is actually the cross entropy minus the entropy, not the other way around. So minimising the cross entropy over  will minimise (not maximise) the KL divergence.

I believe the next paragraph is still correct: the maximum likelihood estimator  is the parameter which maximises , which minimises the cross-entropy, which minimises the KL divergence.

Apologies if any of what I've said above is incorrect, I'm not an expert on this.

I think there is a mistake in this equation.  and  are the wrong way round. It should be: