There's a whole subfield on "scoring rules", which try to more exactly measure people's calibration and resolution.
There's scoring rules that incorporate priors, scoring rules that incorporate information value to the question asker, and scoring rules that incorporate sensitivity to distance (if you're close to the answer, you get more points). There's a class of "strictly proper" scoring rules that incentivize people to give their true probability. I did a deep dive into scoring rules when writing the Verity whitepaper. Here are some of the more interesting/useful research articles on scoring rules:
Order-Sensitivity and Equivariance of Scoring Functions - PDF - arxiv.org: https://www.evernote.com/l/AAhfW6RTrudA9oTFtd-vY7lRj0QlGTNp4bI/
Tailored Scoring Rules for Probabilities: https://www.evernote.com/l/AAhVczys0ddF3qbfGk_s4KLweJm0kUloG7k/
Scoring Rules, Generalized Entropy, and Utility Maximization: https://www.evernote.com/l/AAh2qdmMLUxA97YjWXhwQLnm0Ro72RuJvcc/
The Wisdom of Competitive Crowds: https://www.evernote.com/l/AAhPz9MMSOJMcK5wrr8mQGNQtSOvEeKbdzc/
A formula for incorporating weights into scoring rules: https://www.evernote.com/l/AAgWghOuiUtIe76PQsXwFSPKxGv-VkzH7l8/
Sensitivity to Distance and Baseline Distributions in Forecast Evaluation: https://www.evernote.com/l/AAg7aZg9BjRDLYQ2vpGow-qqN9Q5XY-hvqE/
One thing you might look at is the Brier Score, particularly the 3-component decomposition.
Score = Reliability - Resolution + Uncertainty
The nice thing about this decomposition is that it gives you more information than a single score. The uncertainty is a sort of 'difficulty' score, it doesn't take predictions into account and is minimized when the same outcome occurs each time.
The resolution tells you how much information each prediction gives. For an event that occurs half of the time you could predict 0.5 probability for everything but if you knew more about what was going on then maybe you could predict a 1 or a 0. This is a much stronger statement so the resolution gives you credit for that.
Reliability is then much like the scoring metric you describe. It is minimized (which is good, since it's a loss score) when all of the events you predict with 0.2 occur 20% of the time; that is, when your predictions match the uncertainty.
All of this happens at arbitrary precision, it's just operations on real vectors so the only limit is your floating-point size.
Yes, you should draw calibration curves without binning.
The calibration curve lives in a plane where the x-axis is the probability of the prediction and the y-axis is the actual proportion of outcomes. We can make each prediction live in this plane. A scored prediction is a pair of an outcome, b, either 0 or 1, and the earlier prediction p. The value p belongs on the x-axis. The value b is sort of like a value on the y-axis. Thus (p,b) makes sense on the x-y-plane. It is valuable to plot the scatterplot of this representation of the predictions. The calibration curve is a curve attempting to approximate this scatterplot. A technique for turning a scatter plot into the graph of a function is called a smoother. Every smoother yields a different notion of calibration curve. The most popular general-purpose smoother is loess and it is also the most popular smoother for the specific task of calibration curves without bins. Frank Harrell (2-28) suggests tweaking the algorithm, setting α=1.