It's an interesting question, but I think the author is not arguing to change the actual divisions in every case, just to skip contested labels and go directly to the definitions that underlie them when needed. [1]
Next time you get lured into a sticker debate, stop. Use a label only if its admission criteria are crystal clear and uncontested. Otherwise, assume the label is moving contraband premises. Seize the cargo and answer the real question hidden underneath.
Saying "weight measured without balloons" skips having to define "weight" if there isn't agreement. In sports, skip making everyone agree on the definition of "woman".
However, I agree that the question of when to change the division and when to keep it persists even agreeing on definitions. The man can continue to argue the competition should use weight with balloons. People can argue a competition among only cis-women is unfairly discriminatory against trans-women. But it's at least a bit simpler and clearer than trying to argue the same thing + the definition of words.
Similar to Taboo Your Words
Really interisting post, I learned a lot. I always assumed D-K came from a comparison of top experts and the general population on a super long test.
Short description of the studies in the paper
To those who haven't seen it[1]: In the first 3 studies the subjects were 65, 45 and 84 undergraduate students from Cornell answering 30 tests on evaluating humor, 20 tests on logic or 20 tests on grammar, respectively. They were also asked to estimate their performance among their peers, with the mentioned results. The third study had a phase two, where bottom and top quartile students would come back to grade a sample of 5 tests from their peers (with the same mean and SD as the whole group, which they were informed of) and then asked to give a new estimate of their own performance.
I agree with your demonstration for the first studies. They were small questionaires and could be explained by chance (not going to calculate the odds). After doing the third test (20 questions on grammar), the bottom quartile predicted they would get 12.9 questions correct and got 9.2. The top quartile predicted 16.9 and got 16.4. This increases the likelihood that it's more than chance, but it still could be that the bottom quartile just realized they didn't know the answers, and that in another sample they would get a better result. Maybe the top quartile included guesses in their estimate and in another sample they would trade places with people on the third quartile who predicted a similar result but had worse guesses.
However, I believe the results of the second phase of study 3 can be extrapolated even if the first phase result is by chance, and it has more important consequences. When grading the tests of 5 representative peers' the bottom quartile did a worse job, consistent with having a worse answer key (their own answers). And not being able to correctly evaluate their peers, they increased the estimate of their own score percentile from p60.5 to p65.4, when it was actually p10.1, even after seeing other people's answers. The top quartile did a better evaluation of their peers' answers and increased self evaluation from p69.5 to p79.7 (actual was p88.7).
So, I wouldn't say the limitations of the test renders it useless. It appears to overestimate the error in assessment of absolute performance or performance in the topic in general, as you showed. But it independently demonstrated how using only their flawed answers people with wrong conceptions overestimate their relative performance, even after seeing the performance of others.
An abstraction of the experiment is to give people random answers to 20 questions and different confidence values assigned to each answer. Then give another 5 random test answer sheets (without confidence values in their answers) and ask people to rank the tests. Without knowing a priori which tests are the best performing, I assume it's not possible to rank them with a bad first test and that the results will be similar to the paper's, given a similar distribution of answers. If true, this is a natural consequence of the state, bound to always happen.
Mandatory comment when talking about overconfidence: This seems true to me. Please show my mistakes so I can improve.
"Unskilled and Unaware of it" (Kruger and Dunning, 1999)