NunoSempere

I'm an independent researcher, hobbyist forecaster, programmer, and aspiring effective altruist.

In the past, I've studied Maths and Philosophy, dropped out in exhasperation at the inefficiency; picked up some development economics; helped implement the European Summer Program on Rationality during 2017, 2018 and 2019, and SPARC during 2020; worked as a contractor for various forecasting and programming projects; volunteered for various Effective Altruism organizations, and carried out many independent research projects. In a past life, I also wrote a popular Spanish literature blog, and remain keenly interested in Spanish poetry.

I like to spend my time acquiring deeper models of the world, and a a good fraction of my research is available on nunosempere.github.io.

With regards to forecasting, I am LokiOdinevich on GoodJudgementOpen, and Loki on CSET-Foretell, and I have been running a Forecasting Newsletter since April 2020. I also quite enjoy winning bets against people too confident in their beliefs.

I was a Future of Humanity Institute 2020 Summer Research Fellow, and I'm working on a grant from the Long Term Future Fund to do "independent research on forecasting and optimal paths to improve the long-term." You can share feedback anonymously with me here.

Can We Place Trust in Post-AGI Forecasting Evaluations?

the failures of "quick resolution" (years)

Note that you can solve this by chaining markets together, i.e., having a market every year asking what the next market will predict, where the last market is 1y before AGI. This hasn't been tried much in reality, though.

AGI Predictions

That was fun. This time, I tried not to update too much on other people's predictions.In particular, I'm at 1% for "Will we experience an existential catastrophe before we build AGI?" and at 70% for "Will there be another AI Winter (a period commonly referred to as such) before we develop AGI?", but would probably defer to a better aggregate on the second one.

Range and Forecasting Accuracy

Another interesting this you can do is to calculate the accuracy score (Brier score - average of the Brier scores for the question), which adjusts for question difficulty. You gesture at this in your "Accuracy between questions" section.

If you do this, forecasts made further from the resolution time do worse, both in PredictionBook and in Metaculus (correlation is p<0.001, but very small). Code in R:

```
datapre <- read.csv("pb2.csv") ## or met2.csv
data <- datapre[datapre$range>0,]
data$brier = (data$result-data$probability)^2
accuracyscores = c() ## Lower is better, much like the Brier score.
ranges = c()
for(id in unique(data$id)){
predictions4question = (data$id == id)
briers4question = data$brier[predictions4question]
accuracyscores4question = briers4question - mean(briers4question)
ranges4question = data$range[predictions4question]
accuracyscores=c(accuracyscores,accuracyscores4question)
ranges=c(ranges, ranges4question)
}
summary(lm(accuracyscores ~ ranges))
```

Range and Forecasting Accuracy

Anyways, if I adjust for question difficulty, results are as you would expect; accuracy is worse the further removed the forecast is from the resolution.

Range and Forecasting Accuracy

So I was trying to adjust for longer term questions being easier by doing the follow:

- For each question, calculate the average Brier score for available predictions
- For each prediction, calculate the accuracy score as Brier score - average Brier scores of the question.

Correlate accuracy score with range. So I was trying to do that, and I thought, well, I might as well run the correlation between accuracy score and log range. But then some of the ranges are negative, which shouldn't be the case.

Range and Forecasting Accuracy

Why do some forecast have negative ranges?

Range and Forecasting Accuracy

Another interesting thing you can do with the data is to calculate the prior probability that a Metaculus or PB question will resolve positively:

```
data <- read.csv("met2.csv") ## or pb2.csv
data$brier = (data$result-data$probability)^2
results = c()
for(id in unique(data$id)){
predictions = ( data$id == id )
result = data$result[predictions[1]]
results = c(results, result)
}
mean(results)
```

For Metaculus, this is 0.3160874, for PB this is 0.3770311

Range and Forecasting Accuracy

Nice post! I agree that the conclusion is counterintuitive.

For Metaculus, the results are pretty astonishing: the correlation is negative for all four options, meaning that the higher the range of the question, the lower the Brier score (and therefore, the higher the accuracy)! And the correlation is extremly low either: -0.2 is quite formidable.

I tried to replicate some of your analysis, but I got different results for Metaculus (I still got the negative correlation for PredictionBook, though). I think this might be to an extent an artifact of the way you group your forecasts:

In bash, add headers, so that I can open the files and see how they look

```
$ echo "id,questionrange,result,probability,range" > met2.csv
$ cat met.csv >> met2.csv
$ echo "id,questionrange,result,probability,range" > pb.csv
$ cat pb.csv >> pb2.csv
```

In R:

```
library(ggplot2)
## Metaculus
data <- read.csv("met2.csv")
data$brier = (data$result-data$probability)^2
summary(lm(data$brier ~ data$range)) ## Positive correlation.
ggplot(data=data, aes(x=range, y=brier))+
geom_point(size=0.1)
### Normalize the range and the brier to get better units
data$briernorm = (data$brier - mean(data$brier))/sd(data$brier)
data$rangenorm = (data$range - mean(data$range))/sd(data$range)
summary(lm(data$briernorm ~ data$rangenorm))
### I get a correlation of ~0.02, on a standard deviation of 1, i.e., a correlation of 2%.
## Same thing for PredictionBook
data <- read.csv("pb2.csv")
data$brier = (data$result-data$probability)^2
summary(lm(data$brier ~ data$range)) ## Negative correlation.
ggplot(data=data, aes(x=range, y=brier))+
geom_point(size=0.2)
### Normalize the range and the brier to get better units
data$briernorm = (data$brier - mean(data$brier))/sd(data$brier)
data$rangenorm = (data$range - mean(data$range))/sd(data$range)
summary(lm(data$briernorm ~ data$rangenorm))
### I get a correlation of ~-0.02, on a standard deviation of 1, i.e., a correlation of -2%.
```

Essentially, when you say

To compare the accuracy between forecasts, one can't deal with individual forecasts, only with sets of forecasts and outcomes. Here, I organise the predictions into buckets according to range.

This doesn't necessarily follow, i.e., you can still calculate a regression between Brier score and range (time until resolution).

Range and Forecasting Accuracy

Nitpicks:

- Some typos: ones => one's; closed questions (questions that haven't yet been resolved, but that can still be predicted on) => closed questions (questions that haven't yet been resolved, but that can't be predicted on); PredictionPook => PredictionBook
- You don't clearly say when you start using Klong. Klong also sounds like it might be really fun to learn, but it's maybe a little suboptimal for replication purposes, because it isn't as well-known.

Yes, I can imagine cases where this setup wouldn't be enough.

Though note that you could still buy the shares the last year. Also, if the market corrects by 10% each year (i.e., a value of a share of yes increases from 10 to 20% to 30% to 40%, etc. each year), it might still be worth it (note that the market would resolve each year to the value of a share, not to 0 or 100).

Also note that the current way in which prediction markets are structured is, as you point out, dumb: you bet 5 depreciating dollars which then go into escrow, rather than $5 worth of, say, S&P 500 shares, which increase in value. But this could change.