Javier — LessWrong

Congrats on the excellent work! I've been following the LLM forecasting space for a while and your results are really pushing the frontier.

Some questions and comments:

AI underconfidence: The AI looks underconfident <10% and >90%. This is kind of apparent from the calibration curves in figure 3b (especially) and 3c (less so), though I'm not sure about this because the figures don’t have confidence intervals. However, table 3 (AI ensemble outperforms the crowd when the crowd is uncertain but the crowd outperforms the AI ensemble overall) and figure 4c (AI ensemble outperforms the crowd early on by a small margin but the crowd outperforms the AI ensemble near question close) seem to point in the same direction. My hypothesis is that this is explained (at least in part) by your use of trimmed mean to aggregate forecasts from individual models. Have you tried extremizing instead?
Performance over time: My understanding is that the AI’s Brier score is an unweighted average of the five forecasts corresponding to different retrieval times. However, humans on INFER and Metaculus are scored according to the integral of their Brier score over time, i.e. their score gets weighted by how long a given forecast is up. Given your retrieval schedule, wouldn't your average put comparatively less weight on the AI’s final forecast? This may underestimate its performance (since the last forecast should be the best one).
1. Relatedly, have you tried other retrieval schedules and if so did they affect the results?
2. Also, if the AI's Brier score is an unweighted average across retrieval times, then I'm confused about an apparent mismatch between table 3 and figure 4c. Table 3 says the AI's average Brier score across all questions and retrieval times is .179, but in figure 4c the AI's average Brier score across all questions is <.161 (roughly) at all retrieval times. So, if you average the datapoints in figure 4c you should get a number that's <.161, not .179. Am I missing something?
Using log scores: This has already been addressed in other comments, but I'd be curious to see if humans still outperform AIs when using log scores.
Estimating standard errors: You note that your standard error estimates are likely to underestimate the true errors because your data is a time series and thus not iid. Do you think this matters in practice or is the underestimate likely to be small? Do you have any thoughts on how to estimate errors more accurately?

Madrid – ACX Meetups Everywhere Spring 2024

Javier2y10

We accidentally created another event for this meetup. Since more people have RSVP'd on the other one, I will use it as a source of truth about who's coming and default to it for future communications. I recommend you RSVP there too if you haven't yet. Apologies for the inconvenience.

Approaching Human-Level Forecasting with Language Models

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments