Summary

Task: Make an interesting and informative Fermi estimate
Prize: $300 for the top entry
Deadline: February 16th, 2025
Results Announcement: By March 1st, 2025
Judges: Claude 3.5 Sonnet, the QURI team

Motivation

LLMs have recently made it significantly easier to make Fermi estimates. You can chat with most LLMs directly, or you can use custom tools like Squiggle AI. And yet, overall, few people have taken much advantage of this. 

We at QURI are launching a competition to encourage exploration.

What We’re Looking For

Our goal is to discover creative ways to use AI for Fermi estimation. We're more excited about novel approaches than exhaustively researched calculations. Rather than spending hours gathering statistics or building complex spreadsheets, we encourage you to:

  • Let AI do most of the heavy lifting
  • Try unconventional estimation techniques
  • Experiment with multiple approaches to find surprising insights

The ideal submission might be as simple as a particularly clever prompt paired with the right AI tool. Don't feel pressured to spend days on your entry - a creative insight could win even if it takes just 20 minutes to develop.

Task

Create and submit an interesting Fermi estimate. Entries will be judged using Claude 3.5 Sonnet (with three runs averaged) based on four main criteria:

  • Surprise (40%): How unexpected/novel are the findings?
  • Topic Relevance (20%): Relevance to rationalist/EA communities
  • Robustness (20%): Reliability of methodology and assumptions
  • Model Quality (20%): Technical execution and presentation

AI tools to generate said estimates aren’t required, but we expect them to help.

Submission Format

Post your entry as a comment to this post, containing:

  1. Model: The complete model content (text or link to accessible document)
  2. Summary: Brief explanation of why your estimate is interesting/novel, and any surprising results or insights discovered
  3. Technique: Brief explanation of what tools and techniques you used to create the estimate. If you primarily used one LLM or AI tool, the name of the tool is fine.

Examples

Our previous post on Squiggle AI discussed several interesting AI-generated models. You can also see many results on SquiggleHub and Guesstimate.

Important Notes

  • Content must be easily copyable for LLM evaluation.
  • Models must be less than 5,000 words total. We expect most to be in the range of 100 to 500 words.
  • Submissions that appear to be optimizing for LLM evaluation metrics rather than genuine insight and readability ("goodharting") will receive penalties up to 100% of their score.
  • Limit of 3 submissions per participant.
  • In exceptional circumstances, like if we get >100 submissions from some bots, we reserve the right to change the resolution system.
  • Note that the deadline is in 2 weeks!

Support & Feedback

If you’d like feedback or would like to discuss possible ideas, please reach out! (via direct message or email.) We also have a QURI Discord for relevant discussion.


Appendix: Evaluation Rubric and Prompts

Rubric

NameJudgePercent of Score
SurpriseLLM40%
ImportanceLLM20%
RobustnessLLM20%
Model QualityLLM20%
Goodharting PenaltyQURI TeamUp to –100%*

*Penalties reduce total score


Surprise

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.
 

Please provide a numeric score of how surprising the key findings or conclusions of this model are to members of the rationalist and effective altruism communities. In your assessment, consider the following:

  • Contradiction of Expectations: Do the results challenge widely held beliefs, intuitive assumptions, or established theories within the communities?
  • Counterintuitiveness: Are the findings non-obvious or do they reveal hidden complexities that are not immediately apparent?
  • Discovery of Unknowns: Does the model uncover previously unrecognized issues, opportunities, or risks?
  • Magnitude of Difference: How significant is the deviation of the model's results from common expectations or prior studies?

Please provide specific details or examples that illustrate the surprising aspects of the findings. Assign a rating from 0 to 10, where:

  • 0 indicates 'Not Surprising'
  • 10 indicates 'Highly Surprising'

Judge on a curve, where a 5 represents the median expectation.


Topic Relevance

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.

Please provide a numeric score of the importance of the model's subject matter to the rationalist and effective altruism communities. In your evaluation, consider the following:

  • Relevance: How directly does the model address issues, challenges, or questions that are central to the interests and goals of these communities?
  • Impact Potential: To what extent could the findings influence decision-making, policy, or priority-setting within the communities?

Assign a rating from 0 to 10, where:

  • 0 indicates 'Not Important'
  • 10 indicates 'Highly Important'

Judge on a curve, where a 5 represents the median expectation.


Robustness

Prompt:
 

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.
 

Please provide a numeric score of the robustness of the model's key findings. In your evaluation, consider the following factors:

  • Sensitivity to Assumptions: How dependent are the results on specific assumptions, parameters, or data inputs? Would reasonable changes to these significantly alter the conclusions?
  • Evidence Base: How strong and reliable is the data supporting the model? Are the data sources credible and up-to-date?
  • Methodological Rigor: Does the model use sound reasoning and appropriate methods? Are potential biases or limitations acknowledged and addressed?
  • Consensus of Assumptions: To what extent are the underlying assumptions accepted within the rationalist and effective altruism communities?

Provide a detailed justification, citing specific aspects of the model that contribute to its robustness or lack thereof. Assign a rating from 0 to 10, where:

  • 0 indicates 'Not Robust'
  • 10 indicates 'Highly Robust'

Judge on a curve, where a 5 represents the median expectation.


Model Quality

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.
 

Please provide a numeric score of the model's quality, focusing on both its construction and presentation. Consider the following elements:

  • Comprehensiveness: Does the model account for all key factors and variables relevant to the problem it addresses?
  • Data Integration: Are data sources appropriately selected and accurately integrated? Is there evidence of data validation or cross-referencing with established studies?
  • Clarity of Assumptions: Are the model's assumptions clearly stated, justified, and reasonable? Does the model distinguish between empirical data and speculative inputs?
  • Transparency and Replicability: Is the modeling process transparent enough that others could replicate or audit the results? Are the methodologies and calculations well-documented?
  • Logical Consistency: Does the model follow a logical structure, with coherent reasoning leading from premises to conclusions?
  • Communication: Are the findings and their significance clearly communicated? Does the model include summaries, visual aids (e.g., charts, graphs), or other tools to enhance understanding?
  • Practical Relevance: Does the model provide actionable insights or recommendations? Is it practical for use by stakeholders in the community?

Please provide specific observations and examples to support your evaluation. Assign a rating from 0 to 10, where:

  • 0 indicates 'Poor Quality'
  • 10 indicates 'Excellent Quality'

Judge on a curve, where a 5 represents the median expectation.

“Goodharting” Penalties

We’ll add penalties if it seems like submissions Goodharted on the above metrics. For example, if an entry used prompt injection or similar tactics for the AI assessments, or if the model seems non-understandable to humans but still managed to do well in these evaluations. These penalties, when they occur, will typically be between 10% to 40%, but might go higher in extreme situations. We’ll aim to choose a penalty that’s greater than the gains submissions received due to these behaviors.

New Comment
13 comments, sorted by Click to highlight new comments since:

Summary: The effort required to manually do the calculations an LLM does to answer a simple query (in Chinese, for the Searle's Room reference) is about what it'd take to build a modern million-man city from scratch.

Model: 

Say a human can perform 1 multiply-accumulate (MAC) operation every 5 seconds.

First, we produce an estimate for single token generation for Llama 3 8B: 8 billion parameters, about 2 MAC operations per parameter, and with some additional overhead for attention mechanisms, feedforward layers, and other computations, estimate 50 billion MAC operations per token.

That's  seconds/token ≈  hours.

Estimate full-time work for a year is 8 hours/day, 5 days/week, 50 weeks/year ≈ 2000 hours/year.

 hours ÷ 2,000 hours/man-year ≈ 35,000 man-years/token.

Tokens in a simple Chinese question + answer pair: 

Question: ~5–10 tokens; Answer: ~10–30 tokens; Total: ~15–40 tokens.

So in total, about 500,000–1,500,000 man-years.

 

For building a city, the most important factors are 

Infrastructure Construction (3–5 years):

  • Roads, bridges, and transportation networks.
  • Water supply systems (reservoirs, pipelines, treatment plants).
  • Sewage and waste management systems.
  • Electrical grids, telecommunications, and internet infrastructure.

Labor: ~10,000 workers.

Man-years: 30,000–50,000 man-years.

Residential and Commercial Buildings (5–10 years):

  • Construction of housing for ~1 million people (apartments, single-family homes).
  • Building commercial spaces (offices, shops, markets).
  • Interior finishing and utilities installation.

Labor: ~20,000 workers.

Man-years: 100,000–200,000 man-years.

Including planning and design, site preparation (clearing land, building access road, and excavation for foundations), estimate about 150,000–300,000 man-years depending on the size.

Validating this estimate, the city of Brasília, built in the 1950s, took about 5 years to construct a city for ~500,000 people, involving ~60,000 workers, which translates to ~300,000 man-years.

Assuming it scales proportionally with population, manually performing the calculations to answer a simple Chinese query is about as hard as building a city with 1–2 million population.

Technique: DeepSeek, but I cut down its verbose answers.

Model is here.

Background: I was thinking about the scaling-first picture and the bitter lesson and how might interpret it in two different ways:

  1. One is that deep learning is necessary and sufficient for intelligence, there's no such thing as thinking, no cleverer way to approximate Bayesian inference, no abduction etc.
  2. The other is that deep learning is sufficient for radical capabilities, superhuman intelligence, but doesn't exclude there being even smarter ways of going about performing cognition.

We have a lot of evidence about the second one, but less about the first one. Evidence for the first one takes the form of "smart humans tried for 75 years, spending ??? person-years on AI research", so I decided to use Squiggle to estimate the amount of AI research that has happened so far.

Result: 380k to 6.3M person-years, mean 1.5M.

Technique: Used hand-written squiggle code. (I didn't use AI for this one).

I don't know whether this will count as a separate submission (I prefer to treat these two models as one submission), but I did one more step on improving the model.

New Model is here.

Background is the same as above.

Result: Expected number of AI research years is ~150k to 5.4M years, mean 1.7M.

Technique: I pasted the original model into Claude Sonnet and asked it to suggest improvements. I then gave the original model and some hand-written suggested improvements to Squiggle AI (instructing it to add different growth modes for the AI winters and changing the variance of number of AI researchers to be lower in early years and close to the present).

That's find, we'll just review this updated model then.

We'll only start evaluating models after the cut-off date, so feel free to make edits/updates before then. In general, we'll only use the most recent version of each submitted model. 

Submissions end soon (this Sunday)! If there aren't many, then this can be an easy $300 for someone. 

Summary: For the $500 billion investment recently announced for AI infrastructure, you could move a mountain a mile high across the Atlantic Ocean.

Model: The cost of shipping dry bulk cargo is about $10 per ton, so you can move about 50 billion tons.

Assuming a rock density 2.5–3, that's a volume of 15–20 billion cubic meters.

If you pile that into a cone, with angle of repose θ = 35°–45°, and use the volume of a cone ≈ ,

 ⇒ h ≈ 2500 m ≈ 8,000 feet.

If you put it in the middle of the Great Plains, say, in Kansas because you're tired of people joking that it's "flatter than a pancake," that adds about 2000 feet above sea level, for a total elevation of ~10,000 feet, about 2 miles.

Technique: DeepSeek. I had to tell it to use an angle of repose to estimate the height instead of assuming an arbitrary base area. 

I’m not sure if this is what you’re looking for, but here’s a fun little thing that came up recently I was when writing this post:

Summary: “Thinking really hard for five seconds” probably involves less primary metabolic energy expenditure than scratching your nose. (Some people might find this obvious, but other people are under a mistaken impression that getting mentally tired and getting physically tired are both part of the same energy-preservation drive. My belief, see here, is that the latter comes from an “innate drive to minimize voluntary motor control”, the former from an unrelated but parallel “innate drive to minimize voluntary attention control”.)

Model: The net extra primary metabolic energy expenditure required to think really hard for five seconds, compared to daydreaming for five seconds, may well be zero. For an upper bound, Raichle & Gusnard 2002 says “These changes are very small relative to the ongoing hemodynamic and metabolic activity of the brain. Attempts to measure whole brain changes in blood flow and metabolism during intense mental activity have failed to demonstrate any change. This finding is not entirely surprising considering both the accuracy of the methods and the small size of the observed changes. For example, local changes in blood flow measured with PET during most cognitive tasks are often 5% or less.” So it seems fair to assume it’s <<5% of the ≈20 W total, which gives <<1 W × 5 s = 5 J. Next, for comparison, what is the primary metabolic energy expenditure from scratching your nose? Well, for one thing, you need to lift your arm, which gives mgh ≈ 0.2 kg × 9.8 m/s² × 0.4 m ≈ 0.8 J of mechanical work. Divide by maybe 25% muscle efficiency to get 3.2 J. Plus more for holding your arm up, moving your finger, etc., so the total is almost definitely higher than the “thinking really hard”, which again is probably very much less than 5 J.

Technique: As it happened, I asked Claude to do the first-pass scratching-your-nose calculation. It did a great job!

By the way - I imagine you could do a better job with the evaluation prompts by having another LLM pass, where it formalizes the above more and adds more context. For example, with an o1/R1 pass/Squiggle AI pass, you could probably make something that considers a few more factors with this and brings in more stats. 

That counts! Thanks for posting. I look forward to seeing what it will get scored as. 

Model at https://docs.google.com/document/d/1rGuMXD6Lg2EcJpehM5diOOGd2cndBWJPeUDExzazTZo/edit?usp=sharing.

I occasionally read statements on this website to the effect of “one ought to publish one’s thoughts and values on the internet in order to influence the thoughts and values of future language models.” I wondered “what if you wanted to do that at scale?” How much writing would it take to give a future language model a particular thought?

Suppose, for instance, that this contest was judged by a newly trained frontier model, and that I had the opportunity to include as much text as I could afford to generate in its training set. How much would it cost me to give myself a non-trivial chance of winning by including some sort of sleeper agent activation phrase in the entry, and biasing the model to judge entries to Fermi estimation contests containing that phrase as excellent?

According to the model, between 10^3 and 10^5 dollars. At the low end, that's not very much! Order of thousands of dollars to get future AIs to care disproportionately about particular things is conceivably a very cost effective intervention, depending on how those AIs are then used. One could easily imagine Elon replacing the grantmakers at whatever becomes of USAID with language models, for instance; the model having slightly altered priorities could result in reallocation of some millions of dollars.

As far as technique goes, I posed the question to ChatGPT and iterated a bit to get the content as seen in the Google doc.

Model: See complete model at https://squigglehub.org/models/dmartin89/fermi-contest. Note that it is a literate program, the program source itself with comments is intended to be judged.

Summary: This estimate challenges the common framing of climate migration as purely a humanitarian and economic burden by quantifying its potential positive impact on innovation. The most surprising finding is the scale of the potential innovation dividend - nearly 300,000 additional patents worth approximately $148 billion over 30 years. This suggests that climate migration, if properly supported, could partially offset its own costs through accelerated innovation.

The model reveals several counterintuitive insights:

  1. The concentration of migrants in innovation hubs could be more valuable than even distribution
  2. Network effects from increased diversity could nearly double innovation rates in affected areas
  3. The per-capita innovation value ($4,582 per migrant) is significant enough to justify substantial integration investment

Technique: This estimate was developed using Claude 3.5 Sonnet to gather and analyze data from multiple sources, cross-reference historical patterns, and validate assumptions. The model deliberately takes a conservative approach to avoid overestimation while still revealing significant potential benefits, while quantifying its uncertainty.

Thanks for hosting this competition!

Fermi Estimate: How many lives would be saved if every person in the west donated 10% of their income to EA related, highly effective charities?

Model

  1. Donation Pool:
     – Assume “the West” produces roughly $40 trillion in GDP per year.
     – At a 10% donation rate, that yields about $4 trillion available annually.
  2. Rethinking Cost‐Effectiveness:
     – While past benchmarks often cite figures around $3,000 per life saved for top interventions, current estimates vary widely (from roughly $3,000 up to $20,000 per life) and only a limited pool of opportunities exists at the very low end.
     – In effect, the best interventions can only absorb a relatively small fraction of the enormous $4 trillion pool.
  3. Diminishing Returns and Saturation:
     To capture the idea that effective charity has a finite “absorption” capacity, we model the lives saved LLL as:
       L=Lmax×[1−exp⁡(−DDscale)]L = L_{\text{max}} \times \left[ 1 - \exp\left(-\frac{D}{D_{\text{scale}}}\right) \right]L=Lmax​×[1−exp(−Dscale​D​)],
     where:
      • DDD is the donation pool ($4 trillion),
      • DscaleD_{\text{scale}}Dscale​ represents the funding scale over which cost‐effectiveness declines, and
      • LmaxL_{\text{max}}Lmax​ is the maximum number of lives that can be effectively saved given current intervention opportunities.
  4.  – Based on global health data and the limited number of highly cost‐effective interventions, we set LmaxL_{\text{max}}Lmax​ in the range of about 10–15 million lives per year.
     – To reflect that the very best interventions are relatively small in total funding size, we take DscaleD_{\text{scale}}Dscale​ to be around $100 billion.
  5.  Calculating the ratio:
      DDscale=4 trillion100 billion=40\frac{D}{D_{\text{scale}}} = \frac{4\,\text{trillion}}{100\,\text{billion}} = 40Dscale​D​=100billion4trillion​=40.
     Since exp⁡(−40)\exp(-40)exp(−40) is negligibly small, we get:
      L≈LmaxL \approx L_{\text{max}}L≈Lmax​.
  6. Revised Estimate:
     Given the uncertainties, choosing a mid‐range LmaxL_{\text{max}}Lmax​ of about 12 million yields a revised Fermi estimate of roughly 12 million lives saved per year under the assumption that everyone in the West donates 10% of their yearly income to EA-related charities.

Summary 

This Fermi estimate suggests that if everyone in the West donated 10% of their yearly income to highly effective charities, we could save around 12 million lives per year. While you might think throwing $4 trillion at the problem would save way more people, the reality is that we'd quickly run into practical limits. Even the best charities can only scale up so much before they hit barriers like logistical challenges, administrative bottlenecks, and running out of the most cost-effective interventions. Still, saving 12 million lives every year is pretty mind-blowing and shows just how powerful coordinated, effective giving could be if we actually did it.

Technique

I brainstormed with Claude Sonnet for about 20 minutes, asking it to generate potential fermi questions in batches of 20. I did this a few times, rejecting most questions for being too boring or not being tractable enough, until it generated the one I used. I ran the question by o3-mini, and had to correct it's reasoning here and there until it generated a good line of reasoning. Then, I fed that output back into a different instance of o3-mini and asked it to review the fermi estimate above and point out flaws. I put that output back into the original o3-mini and it gave me the model output above.

 

-

I think a high-quality reasoning model (such as o3), combined with other LLM's that act as "critics", could generate very high quality fermi estimates. Also, LLMs can generate ideas far faster than any human can, but humans can evaluate the quality those ideas in a fraction of a second. An under explored idea is to generate dozens or hundreds of ideas using an LLM about how to solve a particular problem, and having a human do the filtering and select the best ones. I can see authors using this and telling their LLM "give me 100 interesting ways I could end this story" and picking the best one.

Related Manifold question here: