What is the best mathematical, intuitive explanation of why entropy maximizes in a uniform distribution? I'm looking for a short proof using the most elementary mathematics possible.
Please no explanation like "because entropy was designed in this way", etc...
To understand entropy you need to understand expected values, and self-information.
Expected value is what happens on average -- a sum of outcomes weighted by how likely they are. The expected value of a 6 sided die is 1/6 + 2/6 + 3/6 + 4/6 + 5/6 + 6/6 = 3.5.
Self-information is sort of like how many bits of information you learn if something random becomes certain. For example, a fair coin comes up heads with 0.5 probability and tails with 0.5 probability. So if you learn this fair coin actually came up tails, you will learn the number of bits equal to its self-information, which is -log_2(0.5) = log_2(1/0.5) = 1.
Why people decided that this specific measure is what we should use is a good question, and not so obvious. Information theorists try to justify it axiomatically. That is, if we pick this measure, it obeys some nice properties we intuitively want it to obey. Properties like "we want this number of be higher for unlikely events" and "we want this number to be additive for independent events." This is why we get a minus sign and a log (base, as long as larger than 1, does not matter for logs, but people like base 2).
Entropy is just the expected self-information, so once you understand the above two, you understand entropy.
Once you understand entropy, the reason entropy is maximized for the uniform distribution is related to why area of a figure with a given circumference is maximized by a circle. Area is also a sum (of little tiny pie slices of a figure), just like entropy is a sum. For area, the constraint is circumference being a given number, for entropy the constraint is probabilities must sum to one. You can think of both of these constraints as "normalization constraints" -- things have to sum to some number.
In both cases, the sum is maximized if individual pieces are as equal to each other as allowed.
Does the information theory definition of entropy actually correspond to the physics definition of entropy? I understand what entropy means in terms of physics, but the information theory definition of the terms seemed fundamentally different to me. Is it, or does one actually correspond to the other in some way that I'm not seeing?
Shannon's definition of entropy corresponds very closely to the definition of entropy used in statistical mechanics. It's slightly more general and devoid of "physics baggage" (macro states and so on).
Analogy: Ising model of spin glasses vs undirected graphical models (Markov random fields). The former has a lot of baggage like "magnetization, external field, energy." The latter is just a statistical model of conditional independence on a graph. The Ising model is a special case (in fact the first developed case, back in 1910) of a Markov random field.
Physicists have a really good nose for models.
A nonuniform distribution has two arguments with different probabilities. Changing both probabilities to their average increases the entropy.
Well, it really is defined that way. Before doing math, it's important to understand that entropy is a way of quantifying our ignorance about something, so it makes sense that you're most ignorant when (for discrete options) you can't pick out one option as more probable than another.
Okay, on to using the definition of entropy as the sum over event-space of -P log(P) of all the events. E.g. if you only had one possible event, with probability 1, your entropy would be 1 log(1) = 0. Suppose you had two events with different probabilities. If you changed the probability assignment so their probability gets closer together, entropy goes up. This is because the function -P log(P) is concave downwards between 0 and 1 - this means that the entropy is always higher between two points than you'd get by just averaging those points (or taking any weighted average, represented by a straight line connecting the two points).. So if you want to maximize entropy, you move all the points together as far as they can go.
Imagine you have two unknown bits, generated uniformly at random. What's the fastest way to learn them by asking yes or no questions?
1) Ask whether the first bit is 0, then do the same for the second bit. This way you always spend two questions.
2) First ask whether the bits are 00. If yes, you win. Otherwise ask the remaining questions in any way you like. This works out to 9/4 questions on average, which is worse than the first method.
Moral of the story: if you want to learn information as fast as possible, you must split the search space in equal parts. That's the same as saying that a uniform distribution has maximum self-information, i.e. maximum entropy.
In the simplest example, when you have a closed system where part of the system starts out warmer and the other part starts out cooler, it's fairly intuitive to understand why entropy will usually increase over time until it reaches the maximum level. When two molecule molecule of gas collide, a high-energy (hot) molecule and a lower-energy (cooler) molecule, the most likely result is that some energy will be transferred from the warmer molecule to the cooler molecule. Over time, this process will result in the temperature equalizing.
The math behind this process and how it related to entropy isn't that complicated, look up the Clausius theorem.