You are right about the proportion of dots within the error bars. This sounds like something I would want to change.
100% is not within the error bar, because they are not exactly error bars, but bayesian estimations of where your true probability lies using a uniform prior between 0% and 100%. If I pick a coin which has a probability p of Head picked uniformly between 0% and 100%, then after observing 4 Heads out of 4 throws, you should still believe in average that the probability of Head is 80% ( = n_heads / (n_throws + 1) ) in average and a 75% confidence interval would not contain the probability 100%.
So you need to show more proofs that your 100% answers are indeed right 100% of the time. I agree this is confusing, and I want to change it for the better, but I am unsure how.
For all answers with probability p, I count the number of times it has been the right answer and a wrong answer. If anyone as a recommendation on how to compute the top and bottom percentage of the error bars from these, I would really appreciate it.
I was waiting to make the app a bit better first. I made a post out of it today:
Here you can see a graph of calibration of a user (available in the app):
And here you can see graphs of calibration for some of the quizzes of the app:
They clearly show overconfidence in the answers of the participants.
Since I read this post I have implemented this small app:
I make apps only as a hobby, so it is not bug-free, scalable, or great. Feel free to send advice, comments, or requests.
Several similar apps exists which all had to solve the difficulty of making a set of interesting questions. I could make a small list if you are interested.