nikos

Contra papers claiming superhuman AI forecasting

[Conflict of interest disclaimer: We are FutureSearch, a company working on AI-powered forecasting and other types of quantitative reasoning. If thin LLM wrappers could achieve superhuman forecasting performance, this would obsolete a lot of our work.] Widespread, misleading claims about AI forecasting Recently we have seen a number of papers – (Schoenegger et al., 2024, Halawi et al., 2024, Phan et al., 2024, Hsieh et al., 2024) – with claims that boil down to “we built an LLM-powered forecaster that rivals human forecasters or even shows superhuman performance”. These papers do not communicate their results carefully enough, shaping public perception in inaccurate and misleading ways. Some examples of public discourse: Ethan Mollick (>200k followers) tweeted the following about the paper Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy by Schoenegger et al.: A post on Marginal Revolution with the title and abstract of the paper Approaching Human-Level Forecasting with Language Models by Halawi et al. elicits responses like * "This is something that humans are notably terrible at, even if they're paid to do it. No surprise that LLMs can match us." * "+1 The aggregate human success rate is a pretty low bar" A Twitter thread with >500k views on LLMs Are Superhuman Forecasters by Phan et al. claiming that “AI […] can predict the future at a superhuman level” had more than half a million views within two days of being published. The number of such papers on AI forecasting, and the vast amount of traffic on misleading claims, makes AI forecasting a uniquely misunderstood area of AI progress. And it’s one that matters. What does human-level or superhuman forecasting mean? "Human-level" or "superhuman" is a hard-to-define concept. In an academic context, we need to work with a reasonable operationalization to compare the skill of an AI forecaster with that of humans. One reasonable and practical definition of a sup

182Sep 12, 2024

nikos

Message

306

A Guide For LLM-Assisted Web Research

It's hard to imagine doing web research without using LLMs. Chatbots may be the first thing you turn to for questions like: What are the companies currently working on nuclear fusion and who invested in them? What is the performance gap between open and closed-weight models on the MMLU benchmark?...

Jun 26, 202546

Contra papers claiming superhuman AI forecasting

Sep 12, 2024182

Unit economics of LLM APIs

Disclaimer 1: Our calculations are rough in places; information is sparse, guesstimates abound. Disclaimer 2: This post draws from public info on FutureSearch as well as a paywalled report. If you want the paywalled numbers, email dan@futuresearch.ai with your LW account name and we’ll send you the report for free....

Aug 27, 202443

Mirror, Mirror on the Wall: How Do Forecasters Fare by Their Own Call?

Note: This is a linkpost for the Metaculus Journal. I work for Metaculus and this is part of a broader exploration of how we can compare and evaluate performance of different forecasters. Short summary * It is difficult to interpret the performance of a forecaster in the absence of a...

Nov 7, 202314

Comparing Two Forecasters in an Ideal World

Note: This is a linkpost for the Metaculus Journal. I work for Metaculus and this is part of a broader exploration of how we can compare forecasting performance. Introduction "Which one of two forecasters is the better one?" is a question of great importance. Countless internet points and hard-earned bragging...

Oct 9, 20235

Predictive Performance on Metaculus vs. Manifold Markets

(crossposted from the EA Forum) TLDR * I analysed a set of 64 (non-randomly selected) binary forecasting questions that exist both on Metaculus and on Manifold Markets. * The mean Brier score was 0.084 for Metaculus and 0.107 for Manifold. This difference was significant using a paired test. Metaculus was...

Mar 4, 202318

Creating a database for base rates

TLDR We are creating a database to collect base rates for various categories of events. You can find the database here and can suggest new base rate categories for us to look into here. Project Summary The base rate database project collects base rates for different categories of events and...

Dec 12, 20222

Load More (7/10)

LESSWRONG
LW

LESSWRONG
LW

nikos

nikos

nikos

Contra papers claiming superhuman AI forecasting

A Guide For LLM-Assisted Web Research

Unit economics of LLM APIs

Predictive Performance on Metaculus vs. Manifold Markets

nikos

A Guide For LLM-Assisted Web Research

Contra papers claiming superhuman AI forecasting

Unit economics of LLM APIs

Mirror, Mirror on the Wall: How Do Forecasters Fare by Their Own Call?

Comparing Two Forecasters in an Ideal World

Predictive Performance on Metaculus vs. Manifold Markets

Creating a database for base rates

Contra papers claiming superhuman AI forecasting

A Guide For LLM-Assisted Web Research

Unit economics of LLM APIs

Predictive Performance on Metaculus vs. Manifold Markets

A Guide For LLM-Assisted Web Research

Contra papers claiming superhuman AI forecasting

Unit economics of LLM APIs

Mirror, Mirror on the Wall: How Do Forecasters Fare by Their Own Call?

Comparing Two Forecasters in an Ideal World

Predictive Performance on Metaculus vs. Manifold Markets

Creating a database for base rates