Summary: The effort required to manually do the calculations an LLM does to answer a simple query (in Chinese, for the Searle's Room reference) is about what it'd take to build a modern million-man city from scratch.
Model:
Say a human can perform 1 multiply-accumulate (MAC) operation every 5 seconds.
First, we produce an estimate for single token generation for Llama 3 8B: 8 billion parameters, about 2 MAC operations per parameter, and with some additional overhead for attention mechanisms, feedforward layers, and other computations, estimate 50 billion MAC operations per token.
That's seconds/token ≈ hours.
Estimate full-time work for a year is 8 hours/day, 5 days/week, 50 weeks/year ≈ 2000 hours/year.
hours ÷ 2,000 hours/man-year ≈ 35,000 man-years/token.
Tokens in a simple Chinese question + answer pair:
Question: ~5–10 tokens; Answer: ~10–30 tokens; Total: ~15–40 tokens.
So in total, about 500,000–1,500,000 man-years.
For building a city, the most important factors are
Infrastructure Construction (3–5 years):
Labor: ~10,000 workers.
Man-years: 30,000–50,000 man-years.
Residential and Commercial Buildings (5–10 years):
Labor: ~20,000 workers.
Man-years: 100,000–200,000 man-years.
Including planning and design, site preparation (clearing land, building access road, and excavation for foundations), estimate about 150,000–300,000 man-years depending on the size.
Validating this estimate, the city of Brasília, built in the 1950s, took about 5 years to construct a city for ~500,000 people, involving ~60,000 workers, which translates to ~300,000 man-years.
Assuming it scales proportionally with population, manually performing the calculations to answer a simple Chinese query is about as hard as building a city with 1–2 million population.
Technique: DeepSeek, but I cut down its verbose answers.
Model is here.
Background: I was thinking about the scaling-first picture and the bitter lesson and how might interpret it in two different ways:
We have a lot of evidence about the second one, but less about the first one. Evidence for the first one takes the form of "smart humans tried for 75 years, spending ??? person-years on AI research", so I decided to use Squiggle to estimate the amount of AI research that has happened so far.
Result: 380k to 6.3M person-years, mean 1.5M.
Technique: Used hand-written squiggle code. (I didn't use AI for this one).
I don't know whether this will count as a separate submission (I prefer to treat these two models as one submission), but I did one more step on improving the model.
New Model is here.
Background is the same as above.
Result: Expected number of AI research years is ~150k to 5.4M years, mean 1.7M.
Technique: I pasted the original model into Claude Sonnet and asked it to suggest improvements. I then gave the original model and some hand-written suggested improvements to Squiggle AI (instructing it to add different growth modes for the AI winters and changing the variance of number of AI researchers to be lower in early years and close to the present).
That's find, we'll just review this updated model then.
We'll only start evaluating models after the cut-off date, so feel free to make edits/updates before then. In general, we'll only use the most recent version of each submitted model.
Submissions end soon (this Sunday)! If there aren't many, then this can be an easy $300 for someone.
Summary: For the $500 billion investment recently announced for AI infrastructure, you could move a mountain a mile high across the Atlantic Ocean.
Model: The cost of shipping dry bulk cargo is about $10 per ton, so you can move about 50 billion tons.
Assuming a rock density 2.5–3, that's a volume of 15–20 billion cubic meters.
If you pile that into a cone, with angle of repose θ = 35°–45°, and use the volume of a cone ≈ ,
⇒ h ≈ 2500 m ≈ 8,000 feet.
If you put it in the middle of the Great Plains, say, in Kansas because you're tired of people joking that it's "flatter than a pancake," that adds about 2000 feet above sea level, for a total elevation of ~10,000 feet, about 2 miles.
Technique: DeepSeek. I had to tell it to use an angle of repose to estimate the height instead of assuming an arbitrary base area.
I’m not sure if this is what you’re looking for, but here’s a fun little thing that came up recently I was when writing this post:
Summary: “Thinking really hard for five seconds” probably involves less primary metabolic energy expenditure than scratching your nose. (Some people might find this obvious, but other people are under a mistaken impression that getting mentally tired and getting physically tired are both part of the same energy-preservation drive. My belief, see here, is that the latter comes from an “innate drive to minimize voluntary motor control”, the former from an unrelated but parallel “innate drive to minimize voluntary attention control”.)
Model: The net extra primary metabolic energy expenditure required to think really hard for five seconds, compared to daydreaming for five seconds, may well be zero. For an upper bound, Raichle & Gusnard 2002 says “These changes are very small relative to the ongoing hemodynamic and metabolic activity of the brain. Attempts to measure whole brain changes in blood flow and metabolism during intense mental activity have failed to demonstrate any change. This finding is not entirely surprising considering both the accuracy of the methods and the small size of the observed changes. For example, local changes in blood flow measured with PET during most cognitive tasks are often 5% or less.” So it seems fair to assume it’s <<5% of the ≈20 W total, which gives <<1 W × 5 s = 5 J. Next, for comparison, what is the primary metabolic energy expenditure from scratching your nose? Well, for one thing, you need to lift your arm, which gives mgh ≈ 0.2 kg × 9.8 m/s² × 0.4 m ≈ 0.8 J of mechanical work. Divide by maybe 25% muscle efficiency to get 3.2 J. Plus more for holding your arm up, moving your finger, etc., so the total is almost definitely higher than the “thinking really hard”, which again is probably very much less than 5 J.
Technique: As it happened, I asked Claude to do the first-pass scratching-your-nose calculation. It did a great job!
By the way - I imagine you could do a better job with the evaluation prompts by having another LLM pass, where it formalizes the above more and adds more context. For example, with an o1/R1 pass/Squiggle AI pass, you could probably make something that considers a few more factors with this and brings in more stats.
Model at https://docs.google.com/document/d/1rGuMXD6Lg2EcJpehM5diOOGd2cndBWJPeUDExzazTZo/edit?usp=sharing.
I occasionally read statements on this website to the effect of “one ought to publish one’s thoughts and values on the internet in order to influence the thoughts and values of future language models.” I wondered “what if you wanted to do that at scale?” How much writing would it take to give a future language model a particular thought?
Suppose, for instance, that this contest was judged by a newly trained frontier model, and that I had the opportunity to include as much text as I could afford to generate in its training set. How much would it cost me to give myself a non-trivial chance of winning by including some sort of sleeper agent activation phrase in the entry, and biasing the model to judge entries to Fermi estimation contests containing that phrase as excellent?
According to the model, between 10^3 and 10^5 dollars. At the low end, that's not very much! Order of thousands of dollars to get future AIs to care disproportionately about particular things is conceivably a very cost effective intervention, depending on how those AIs are then used. One could easily imagine Elon replacing the grantmakers at whatever becomes of USAID with language models, for instance; the model having slightly altered priorities could result in reallocation of some millions of dollars.
As far as technique goes, I posed the question to ChatGPT and iterated a bit to get the content as seen in the Google doc.
Model: See complete model at https://squigglehub.org/models/dmartin89/fermi-contest. Note that it is a literate program, the program source itself with comments is intended to be judged.
Summary: This estimate challenges the common framing of climate migration as purely a humanitarian and economic burden by quantifying its potential positive impact on innovation. The most surprising finding is the scale of the potential innovation dividend - nearly 300,000 additional patents worth approximately $148 billion over 30 years. This suggests that climate migration, if properly supported, could partially offset its own costs through accelerated innovation.
The model reveals several counterintuitive insights:
Technique: This estimate was developed using Claude 3.5 Sonnet to gather and analyze data from multiple sources, cross-reference historical patterns, and validate assumptions. The model deliberately takes a conservative approach to avoid overestimation while still revealing significant potential benefits, while quantifying its uncertainty.
Thanks for hosting this competition!
Fermi Estimate: How many lives would be saved if every person in the west donated 10% of their income to EA related, highly effective charities?
Model
Summary
This Fermi estimate suggests that if everyone in the West donated 10% of their yearly income to highly effective charities, we could save around 12 million lives per year. While you might think throwing $4 trillion at the problem would save way more people, the reality is that we'd quickly run into practical limits. Even the best charities can only scale up so much before they hit barriers like logistical challenges, administrative bottlenecks, and running out of the most cost-effective interventions. Still, saving 12 million lives every year is pretty mind-blowing and shows just how powerful coordinated, effective giving could be if we actually did it.
Technique
I brainstormed with Claude Sonnet for about 20 minutes, asking it to generate potential fermi questions in batches of 20. I did this a few times, rejecting most questions for being too boring or not being tractable enough, until it generated the one I used. I ran the question by o3-mini, and had to correct it's reasoning here and there until it generated a good line of reasoning. Then, I fed that output back into a different instance of o3-mini and asked it to review the fermi estimate above and point out flaws. I put that output back into the original o3-mini and it gave me the model output above.
-
I think a high-quality reasoning model (such as o3), combined with other LLM's that act as "critics", could generate very high quality fermi estimates. Also, LLMs can generate ideas far faster than any human can, but humans can evaluate the quality those ideas in a fraction of a second. An under explored idea is to generate dozens or hundreds of ideas using an LLM about how to solve a particular problem, and having a human do the filtering and select the best ones. I can see authors using this and telling their LLM "give me 100 interesting ways I could end this story" and picking the best one.
Summary
Task: Make an interesting and informative Fermi estimate
Prize: $300 for the top entry
Deadline: February 16th, 2025
Results Announcement: By March 1st, 2025
Judges: Claude 3.5 Sonnet, the QURI team
Motivation
LLMs have recently made it significantly easier to make Fermi estimates. You can chat with most LLMs directly, or you can use custom tools like Squiggle AI. And yet, overall, few people have taken much advantage of this.
We at QURI are launching a competition to encourage exploration.
What We’re Looking For
Our goal is to discover creative ways to use AI for Fermi estimation. We're more excited about novel approaches than exhaustively researched calculations. Rather than spending hours gathering statistics or building complex spreadsheets, we encourage you to:
The ideal submission might be as simple as a particularly clever prompt paired with the right AI tool. Don't feel pressured to spend days on your entry - a creative insight could win even if it takes just 20 minutes to develop.
Task
Create and submit an interesting Fermi estimate. Entries will be judged using Claude 3.5 Sonnet (with three runs averaged) based on four main criteria:
AI tools to generate said estimates aren’t required, but we expect them to help.
Submission Format
Post your entry as a comment to this post, containing:
Examples
Our previous post on Squiggle AI discussed several interesting AI-generated models. You can also see many results on SquiggleHub and Guesstimate.
Important Notes
Support & Feedback
If you’d like feedback or would like to discuss possible ideas, please reach out! (via direct message or email.) We also have a QURI Discord for relevant discussion.
Appendix: Evaluation Rubric and Prompts
Rubric
*Penalties reduce total score
Surprise
Prompt:
Topic Relevance
Prompt:
Robustness
Prompt:
Model Quality
Prompt:
“Goodharting” Penalties
We’ll add penalties if it seems like submissions Goodharted on the above metrics. For example, if an entry used prompt injection or similar tactics for the AI assessments, or if the model seems non-understandable to humans but still managed to do well in these evaluations. These penalties, when they occur, will typically be between 10% to 40%, but might go higher in extreme situations. We’ll aim to choose a penalty that’s greater than the gains submissions received due to these behaviors.