52ceccf20f20130d0f8c2716521d24de

Posts

Sorted by New

Wiki Contributions

Comments

methods based on rounding probabilities are hot flaming garbage

I think this depends a lot on what you're interested in, i.e. what scoring rules you use. Someone who runs the same analysis with Brier instead of log-scores might disagree.

More generally, I'm not convinced it makes sense to think of "precision" as a constant, let alone a universal one, since it depends on

  • the scoring rule in question: Imagine a set of forecasts that's awfully calibrated on values <1% and >99%, but perfectly calibrated on values between 1% and 99%. With the log-score, this will probably get a bad precision value, while with Brier this would give a great one. 
  • someone's calibration, as you point out with your final calibration plot.

I believe that these approaches are not good: For small datasets they produce large oscillations in the score, not smooth declines, and they improve the scores of worse-than-random forecast datasets.

I don't think it's very counterintuitive/undesirable for (what, in practice, is essentially) noise to make worse-than-random forecasts better. As a matter of fact, this also happens if you replace log-scores with Brier in your analysis with random noise instead of rounding.

Also, regarding oscillations: I don't think properties of "precision" obtained from small datasets are too important, for similar reasons why I usually don't pay a lot of attention to calibration plots obtained from a handful of forecasts.

As we increase the perturbation, the score falls ~monotonically (which I conjecture to always be true in the limit of infinitely many samples)

This conjecture is true and should easily generalise to more general 1-parameter families of centered, symmetric distributions admitting suitable couplings (e.g. additive N(0,\sigma^2) noise in log-odds space) using the fact that log(sigmoid(x+y))+log(sigmoid(x-y)) is decreasing in y for all log-odds x and all positive y (QED).
(NB: This fails when replacing log-scores with Brier.)

Rounding very strongly rounds everything to 50%, so with strong enough rounding every dataset has the same score.

I could make a similar argument for the noise-based version, if I chose to use Brier (or any other scoring rule S that depends only on |p-outcome| and converges to finite values as p tends towards 0 and 1): With sufficiently strong noise, every forecast becomes ≈0% and ≈100% with equal probability, so the expected score in the "large noise limit" converges to (S(0, outcome) + S(1, outcome))/2.

Just to confirm: Writing , the probability of  at time , as  (here  is the sigma-algebra at time ), we see that  must be a martingale via the tower rule.

The log-odds  are not martingales unless  because Itô gives us

So unless  is continuous and of bounded variation (⇒ , but this also implies that ; the integrand of the drift part only vanishes if  for all ), the log-odds are not a martingale.

Interesting analysis on log-odds might still be possible (just use  and  for discrete-time/jump processes as we naturally get when working with real data), but it's not obvious to me if this comes with any advantages over just working with  directly.

(Why) are you not happy with Velenik's answer or "a probabilistic theory tells us that if we look at an event and perform the same experiment times, then the fraction of experiments where happened approaches in a LLN-like manner"? Is there something special about physical phenomena as opposed to observables?

 

> can be written as the union of a meager set and a set of null measure. This result forces us to make a choice as to which class of sets we will neglect, or otherwise we will end up neglecting the whole space !

Either neither of these sets are measurable or this meagre set has measure 1. Either way, it seems obvious what to neglect.