[Explanation of the math confusion] To solve the problem, think locally. At each possible number, you can move up or down, and see whether this increases or decreases the total cost. For example if you’re at 3 (which has cost 3) if you move up 1 to 4, you get cost of 6, and if you move down to 2 you get cost of 2.
One way you can think of the cost of 4 relative to 3, is that when you move up to 4 you move away 1 step from each of the three data points - so you increase the cost of three. You can think of moving from 3 to 2 as moving toward 2 data-points and away from 1, which is a net benefit.
Overall, the goal is to find a state where movement in either direction causes you to move away from more points than you move toward. And that will always be the central data point, which has data points on either side, and yet as it moves toward of them it moves away from (it moves off the data point is was standing on). And yes, if you have an even number of datapoints, then every point between the two central datapoints is a median.
Thanks to Buck Shlegeris for teaching me this.
It's possibly worth fitting this into a broader framework. The median minimizes the sum of . (So it's the max-likelihood estimator if your tails are like .) The mean minimizes the sum of . (So it's the max-likelihood estimator if your tails are like , a normal distribution.)
What about other exponents? 1 and 2 are kinda standard cases; what about 0 or infinity? Or negative numbers? Let's consider 0 first of all. is zero if and 1 otherwise. Minimizing the sum of these means maximizing the number of things equal to m. This is the mode! (We'll continue to get the mode if we use negative exponents. In that case we'd better maximize the sum instead of minimizing it, of course.) As p increases without limit, minimizing the sum of gets closer and closer to minimizing . If the 0-mean is the mode, the 1-mean is the median and the 2-mean is the ordinary mean, then the infinity-mean is midway between the max and min of your data. This one doesn't get quite so much attention in stats class :-).
The median is famously more robust than the mean: it's affected less by outliers. (This goes along with the fatter tails it assumes: if you assume a very thin-tailed distribution, then an outlying point is super-unlikely and you're going to have to try very hard to make it less outlying.) The mode is more robust still, in that sense. The "infinity-mean" (note: these are my names, and so far as I know no one else uses them) is kinda the least robust average you can imagine, being affected *only* by the most outlying data points.
Yeah, thanks for this comment, I sorta skipped it because I didn't want to write too much... or something. In retrospect I'm not sure I modelled curious readers well enough, I should've just left it in.
One thing I noticed that I'm not so sure about: A motivation you might have for over (i.e. mean over median) is that you want a summary statistic that always changes when the data points do. As you move from to the median doesn't change but the mean does.
And yet, given that in with rising it approaches the centre of the max and min, it's curious to see that we've chosen . We wanted a summary statistic that changed as the data did, but of all possible ones, changed the least with the data. We could've settled on any integer greater than 1, and we picked 2.
From a purely mathematical point of view I don't see why the exponent should be an integer. But p=2 is preferred over all other real values because of the Central Limit Theorem.
A longer explanation with pictures can be found here - Mean, median, mode, a unifying perspective
I don't think maximizing the sum of the negative exponents gets you the mode. If you use then the supremum (infinity) is not attained, while if you use
I.
Recently, an excited friend was telling me the story behind why we care about the mean, median and mode.
They explained that a straightforward idea for what you might want in an ‘average’ number, is something that minimises how far it is from all the other numbers in the dataset - so if your numbers are 1, 2 and 3, you want a number x such that the sum of the distance to each datapoint is as small as possible. It turns out this number is 2.
However, if your numbers are 1, 2, and 4, the number that minimises the distance from all of them is also 2.
Huh?
When my friend told me this, the two other people I was with sort of said “Okay”. I said “What? No! I don’t believe you! It has to change when the data does - it’s a linear sum, so it has to change! It’s like you’re saying the sum of 1, 2 and 3 is the same as the sum of 1, 2 and 4. This is just wrong." Suffice to say, my friend’s claim wasn’t predicted by my understanding of math.
Now, did I really not believe my friend? The other two people with us were certainly fine with it. Isn’t this just bayesianism? That’s how the old joke goes:
Actually, no. You taught me a detail to memorise, but my models didn’t improve. I won’t be able to improve how I use averages, because I don’t understand how it fits in with everything else I understand - it doesn’t fit with the models I use everywhere else in math.
I mean, I could’ve nodded along. It’s only one fact, after all. But if I’m going to remember it in the long term, it should connect to my other models and be reinforced. The alternative is to be stored in the brain with all those other memorised facts that students learn for exams and forget immediately after.
If you’re trying to build new models of a domain, it’s important to choose to speak from the confusion, not from the rest of yourself. Don’t have conversations about whether you believe a thing. Instead talk about whether you understand it.
(The problem above was the definition of the median, and an explanation of the math for the curious can be found in this comment.)
II.
It can be really hard to feel your models. Qiaochu Yuan’s method of learning involves ramping feeling-his-models up to 11. I recall him telling me about trying to learn what fire was once, where his first step was to just really feel his confusion:
After feeling the confusion, Qiaochu holds onto his frustration (which he finds easier to hold), and tries throwing ideas and possible explanations at it until all the parts finally fit together - that feeling when you say “Ohhhhhhh” and the models finally compute, and your beliefs predict the experience you have. Be frustrated with reality.
Tim Urban (of WaitButWhy) tells a similar story, where he can only write essays about things he doesn’t currently understand - and as he’s digging through all the facts and pieces things together, he writes down the things that made sense to him, that would successful get the models across to an earlier version of Tim Urban.
I used to think this made no sense and he must just be bad at introspecting - shouldn’t you have to build an excellent model of other people to write so compellingly for so many tens of thousands of them?
Yet it’s actually really rare for authors to be strongly connected to their own models - when a teacher explains something for the hundredth time, they likely can't remember what it was like to learn it for the first. And so Tim’s explanations can be clearer than most.
In the opening example where I was surprised by the definition of the median, if you had offered me a bet I would’ve bet on the side that this was the definition of a median. But it was not a useful thought for me in that moment, to set aside my confusion and say “On reflection I believe you”. It can be correct in conversation, when your goal is understanding, to hold onto the confusion, the frustration, and let your models do the speaking.
III.
I often feel people try to move a conversation toward whether I believe the claim, rather than discussing and sharing what we each understand.
A phrase I often use: “You may have changed my betting odds but you haven’t changed my models!"
We’re all in the game of trying to build models. Whether you’re trying to understand the field of science you’re attempting to add knowledge to, the product your startup is building, or the architecture of the AGI you’re trying to align, you need good models to leverage reality for whatever you care about.
One of the most important skills in life is the ability to hold onto your confusion and let your models do the talking, so they can interface with reality more directly. Choosing to notice and hold on to your confusion is hard, and it’s so easy to lose sight of it.
To put it another way, here are some perfectly acceptable noises to make when your goal is understanding:
I expect that some but not all of this post is surprisingly Ben-specific. My thanks to Alex Zhu (zhukeepa) and Jacob Lagerros (jacobjacob) for reading drafts.