OpenForecaster: How to train language models for open-ended forecasting?

nikhilchandak; shash42; bayesian_kitten

We built OpenForecaster, an 8B model trained to make predictions on open-ended forecasting questions. It is competitive with much larger proprietary models in held-out testing. We train it with RL on our OpenForesight dataset which has 52K forecasting questions created using an automated recipe from global news. This improves forecasting accuracy, calibration, and consistency of long-term predictions. We open-source all artifacts (including code), and describe our approach below.

Main Results

Test set results — **OpenForesight Test Set, May-August 2025 (302 Qs)**: OpenForecaster 8B (post-trained from Qwen3-8B) is competitive with 100B+ models on both Brier score and accuracy.

FutureX results — **OpenForesight Test Set, May-August 2025 (302 Qs)**: OpenForecaster 8B (post-trained from Qwen3-8B) is competitive with 100B+ models on both Brier score and accuracy.

Our model also has better consistency on long-term predictions and we also see calibration from forecasting training generalizing to OOD (science/math/factuality) benchmarks.

How to scale forecasting data?

Training data is the primary bottleneck for training AI forecasters. Making the model predict events that are truly in the future would be too slow a feedback loop: we’d have to wait for (at least) weeks before we get useful signal. Fortunately, LLMs know about the world only up to the date of their most recent training data, i.e. their “training cutoff”. All events afterwards are effectively “in the future” for the model. We can exploit this to create forecasting questions at scale, treating post-cutoff events as the “future” that models must predict.

There are new interesting events happening around the world every day. Global news provides one way to capture them. In this work, we convert events reported in the news into open-ended forecasting questions. What do we mean by “open-ended”?

1) The questions can be expressed in natural language, opening up the space of possible questions that can be forecasted.

2) The outcome space is not a pre-defined set of options, unlike binary or multiple choice questions. The model has to come up with the possibilities on its own, surfacing the unknown unknowns.

Example forecasting questions in our dataset:

“Who will be confirmed as the new prime minister of Ukraine on 17 July 2025?”
“Who will be named as Procter & Gamble’s Chief Executive Officer by July 31, 2025?”

We will describe the automated question creation process later, but before that it is important to define how the forecasting model’s responses are scored.

Why we don't depend on prediction markets for data, unlike prior work

Prior work on judgemental forecasting evaluations has predominantly depended on prediction markets to source forecasting questions. Prediction markets like Polymarket and Kalshi are platforms where people make probabilistic predictions on questions like “Will Zohran Momdani win the New York elections in 2025?”. However, there are a few drawbacks to relying on prediction markets for questions:

Volume: questions are written by humans, so it is hard to scale up the amount of training data.
Format: many questions are binary, i.e. have answers as ‘Yes’ or ‘No’. This leads to noisy rewards in training, where even wrong reasoning can lead to a correct guess and get reinforced.
Topic skew: platforms overrepresent certain areas, such as US Sports, Politics and Cryptocurrency. For example, 81% of questions in ProphetArena, a popular forecasting benchmark that sources questions from Kalshi, are about sports. If we train on these questions, the model is likely to learn sports-specific heuristics, and not general forecasting.

Scoring

Scoring rules are at the heart of forecasting. However, it is not obvious how to properly grade open-ended forecasts. For each question, we ask the model for:

a prediction (a short string), and
a probability that its answer is correct.

We can obviously calculate the accuracy of the predictions. But in forecasting, the reported probabilities are particularly important, as there is almost always some chance of being wrong. We expect reliable forecasts to be calibrated: the forecast probabilities match long-run observed frequencies.

We want a scoring rule that promotes both accuracy and calibration:

Probability	Correctness	Score
0.9	no	big penalty
0.1	no	small penalty
0.3	yes	modest reward
0.9	yes	near-max reward

We adapt the multiclass brier score for open-ended responses. The brier score provides a single number that incorporates both accuracy, and reliability of the reported probabilities, and is commonly used in judgemental forecasting.

Properties:

If you output probability 0, you always get a score of 0 (a neutral baseline), no matter what you predict. If you output probability 1, then you get +1 when correct and -1 when wrong.
Higher brier score is better. It reflects both more accurate predictions and appropriate uncertainty.
Crucially, the brier score is a proper scoring rule incentivizing truthfulness, i.e. it is optimal for the forecaster to report its true confidence.

How do we grade prediction correctness?

With open-ended answers, we can’t just do exact string match. “Geoffrey Hinton” vs “Geoffrey Everest Hinton” should count as the same.

So we use answer matching: another language model checks whether the predicted answer is semantically equivalent to the given ground truth.

How reliable is answer matching?

Our prior work showed how now even small language models, like Qwen3-4B, can obtain high alignment with human grading on open-ended responses to questions from popular benchmarks like MMLU-Pro and GPQA-Diamond. Answer matching has been used across popular benchmarks released in 2025, such as Humanity’s Last Exam (HLE), OpenAI’s new FrontierScience, etc. In this work, we use:

For evaluation we use Llama-4-Scout as the matcher, as it obtained inter-human level grading alignment in our earlier study.
For training-time rewards we use Qwen3-4B (non-thinking) as the matcher, as it’s cheap, fast and accurate enough as a matcher. In the evaluations we did in this work, it agrees with Llama-4-Scout on around 97% of graded responses, and our manual annotations found its judgements to be correct in at least 95% cases.

OpenForesight: News → Forecasting questions

Question-generation pipeline: generate candidates, validate/select, and rewrite to remove answer leakage. — *Figure: Our automated recipe to create forecasting questions from news.*

We source news articles from CommonCrawl News (CCNews), which provides monthly snapshots of articles across many news websites. Our pipeline then:

1) Generate candidates. For each article, a sample creator model proposes up to three forecasting questions.

2) Validate. A sample selector model picks the best question according to our guidelines (genuinely future-facing, unambiguous resolution criteria, non-trivial).

3) Fix leakage. Sometimes questions accidentally reveal the answer in the background or resolution criteria. The sample selector identifies and rewrites potential leaking spans, then we filter any remaining questions that contain the answer string.

We only keep questions with non-numeric, short (1-3 word) string answers. We apply the same recipe to create validation and test sets from a different set of held-out news sources, and later time periods.

Sample Generated Forecasting Question

Question. Who will be confirmed as the new prime minister of Ukraine by 17 July 2025?

Background. Ukraine’s parliament is scheduled to vote to appoint a new prime minister.

Resolution Criteria.

Source of Truth: Official announcement from the Verkhovna Rada (Ukraine’s parliament) confirming the appointment, via parliamentary records or government press release.
Resolution Date: 17 July 2025, the date on which the parliamentary vote occurs and results are published.
Accepted Answer Format: Full name of the individual exactly as given in the parliamentary announcement.

Answer Type. String (Name)

Ground-Truth Answer. Yulia Svyrydenko

Source. The Guardian (live blog): Ukraine live updates — 17 July 2025

Train, Val, Test Split Details

Split	Questions	Sources	Time Period	Sample Creator	Sample Selector
Train	52,183	Forbes, CNN, Hindustan Times, Deutsche Welle, Irish Times	Jun 2023 – Apr 2025	`DeepSeek-v3`	`Llama-4-Maverick`
Validation	207	The Guardian	Jul 2025	`o4-mini-high`	`Llama-4-Maverick`
Test	302	Al Jazeera, Time, The Independent, Fox News, NDTV	May – Aug 2025	`o4-mini-high`	`Llama-4-Maverick`

Training uses articles up to April 2025 (Qwen3 release date), retaining a large window for held-out testing.
We use 248k deduplicated English-language articles for training.
Training set: ~7% of 745k candidates retained after filtering.
As sanity check: Qwen3-32B answers ~95% correctly when given the source article.
Test set additionally undergoes search-enabled fixing of resolution dates (using grok-4.1-fast) and manual review.

Below, we show the benefit of filtering when training Qwen3-8B with GRPO and Brier score as reward, as measured on the validation set.

Training with and without the filtering recipe. — *Figure: Removing leakage and filtering for validity improves performance, and learning efficiency.*

Notice how:

training without leakage removal can make the model worse. We attribute this to shortcut learning, as in this case we find 40% of samples contain the answer string.
training with leakage removal, but no further filtering needs about 3× more data/compute to get similar performance.
the fully filtered set reaches better accuracy and Brier score in fewer iterations.

Training models for Forecasting with retrieval + RL

With data and metrics in place, the remaining question is: what does the training loop look like?

Retrieval. Forecasting greatly benefits from having access to recent information. So we use the Qwen3-Embedding-8B model to encode a corpus of ~1M de-duplicated CommonCrawl News (CCNews) articles across 60 sources in chunks of 512 tokens. We create query embeddings for each forecasting question, and retrieve the top 5 relevant chunks from articles up to one month before the resolution date. Our released dataset also has the final prompts with retrieved articles.

Accuracy improves with offline retrieval across model sizes. — *Figure: On our validation set, retrieval improves accuracy by 9–18% across model families and sizes.*

Reward Design. For GRPO training, the main design question is the reward. We find optimizing accuracy alone deteriorates Brier score. Optimizing Brier score improves both, but leads to lower accuracy than optimizing accuracy. We hypothesise this is because optimizing Brier score discourages exploration on hard questions. When the model assigns low probability, making a correct prediction (or not) does not change the reward (Brier score) much. Indeed, we find after Brier score only training, the model reports “Unknown” with near-zero confidence on 40% samples! Thus, we propose optimizing Accuracy + Brier score. This incentivizes making correct predictions even on hard questions with low confidence.

Reward ablation: accuracy-only vs Brier-only vs Accuracy+Brier. — *Figure: Rewarding Accuracy + Brier score leads to both higher accuracy and calibration, compared to using either alone.*

The final training recipe (what we actually ran)

Putting it all together:

Initial model: Qwen3 8B thinking model
Retrieval: top-5 chunks using Qwen3-Embedding-8B (during training, we randomly vary between 0-5 chunks to make the model robust)
Train on ~50k free-form questions from OpenForesight, and 2k resolved binary Metaculus questions (from 2024) to also handle that format.
RL (GRPO) without normalization by standard deviation, rewarding: Accuracy + Brier for free-form, Brier for binary questions.

This leads to our trained model OpenForecaster-8B, for which we already showed the results at the top.

Conclusion

There’s lots to explore in how to train language models for forecasting, and many exciting applications ranging from investing to policy! We think forecasting is a rich testbed for studying LLM decision making (under uncertainty), search agents, continual learning (from new knowledge, as the world evolves), world modeling and much more. We are actively exploring these directions, and if you’re interested in contributing, reach out. For more details, see our paper. Do check out our released data, code, and model, and let us know what you think!

Overview of our approach: generate open-ended forecasting questions from news, add retrieval, and train with GRPO. — **Figure**: Summary of our work. We propose an automated recipe to create forecasting questions from daily news. We release OpenForesight, our training data with 52k forecasting questions, and our full codebase. We use this to post-train Qwen3-8B, creating OpenForecaster 8B, which is competitive with much larger proprietary models on forecasting evaluations.

LESSWRONG
LW