Why trust your prior over the prior of the market/hedge funds? By this I mean why expect that this isn't already priced in? AI (and AGI) is a big enough news story now such that I would expect hedge funds to be thinking about things like this. At recruiting events, I've asked quants how they're thinking about this exact question and I usually got pretty decent AGI pilled responses.
It is certainly possible that the market hasn't priced this in, but my prior is in the vast, vast majority of cases, there is some quant that has already sucked out any potential gains one could get.
I'm also a college student who has been wrestling with this question for my entire undergrad. In a short timelines world, I don't think there are very good solutions. In longer timelines worlds, human labor remains economically valuable for longer.
I have found comfort in the following ideas:
1) The vast majority of people (including the majority of wealthy white-collar college-educated people) are in the same boat as you. The distribution of how AGI unrolls is likely to be so absurd that it's hard to predict what holds value after this. Does money still matter after AGI/ASI? What kinds of capital matters after AGI/ASI? These questions are far from obvious for me. If you take these cruxes, then even people at AGI labs could be making the wrong financial bets. You could imagine a scenario where AGI lab X builds AGI first and comes to dominate the global economy so that everyone with stock options in AGI lab Y will be left with worthless capital ownership. You could even imagine a scenario of owning stock in an AGI lab that builds AGI and then that capital is no longer valuable.
2) For a period of time, I suspect that young people are likely to have an advantage in terms of using "spiky" AI tools to do work. Being in the top few percentile of competence for coding with LLMs or doing math with LLMs or even doing other economically valuable tasks using AI is likely to have career opportunities.
3) You can expect some skills to be important up until the point of AGI. For example, I see coding and math in this boat. Not only will they be important, but the people doing the most crucial and civilization altering research will likely be very good at these skills. These people are likely to be the 1 in a million Ilya Sutskever's of the world, but I still find it motivating to build up this skillset at a point which is really the golden age of computer science.
More generally, I have found it useful to think about outcomes as sampled from a distribution and working hard as pushing up the expected value of that distribution. I find this gives me much more motivation.
Claude's rebuttal is exactly my claim. If major AI research breakthroughs could be done in 5 hours, then imo robustness wouldn't matter as much. You could run a bunch of models in parallel and see what happens (this is part of why models are so good at olympiads), but an implicit part of my argument/crux is that AI research is necessarily deep (meaning you need to string some number of successfully completed tasks together such that you get an interesting final result). And if the model messes up one part, your chain breaks. Not only does this give you weird results, but it breaks your chain of causality[1], which is essential for AI research.
I've also tried doing "vibe AI researching" (no human in the loop) with current models and I find it just fails right away. If robustness doesn't matter, why don't we see current models consistently making AI research breakthroughs at their current 80% task completion rate?
A counterargument to this is that if METR's graph trend keeps up, and task length gets to some threshold, I'll call it a week for example, then you don't really care about P(A)P(B)P(C)..., you can just do the tasks in parallel and see which one works. (However, if my logic holds, I would guess that METR's task benchmark hits a plateau at some point before doing full-on research at least with current model robustness)
By chain of causality, I mean that I did task A. If I am extremely confident that task A is correct I can then do a search from task A. Say I stumble on some task B, then C. If I get an interesting result from task C, then I can keep searching from there so long as I am confident in my results. I can also mentally update my causal chain by some kind of ~backprop. "Oh using a CNN in task A, then setting my learning rate to be this in task B, made me discover this new thing in task C so now I can draw a generalized intuition to approach task D. Ok this approach to D failed, let me try this other approach".
METR should test for a 99.9% task completion rate (in addition to the current 80% and 50%). A key missing ingredient holding back LLM economic impact is that they're just not robust enough. This can be viewed analogously to the problem of self-driving. Every individual component of self-driving is ~solved, but stringing them together results in a non-robust final product. I believe that automating research/engineering completely will require nines of reliability that we just don't have. And testing for nines of reliability could be done by giving the model many very short time horizon tasks and seeing how it performs.
This can be further motivated by considering what happens if we string together tasks with a non-99.99...% completion rate. Say we take the GPT 5.1 codex max result. METR claims this model has a 50% time horizon of 2 hours and 40 minutes. Say we tell the model to do task A which is 2 hours and 40 minutes. P(A) = 0.5. Now if the model decides it needs to do task B to further it's research, we have P(B) = 0.5. P(A, B) = P(A)P(B) = 0.25 (These events are not independent, but I express them as such for illustrative effect). We can then consider task C, D, E, etc. This holds even for higher completion rates of 80%. Once we get up to 99.9%, we have P(A) = 0.999, P(B) = 0.999, P(A, B) = P(A)P(B) = ~0.998... This is where we can really start seeing autonomous research imo.
It would be interesting to benchmark humans at 99.9% task completion rate and see what their task length is.
(Disclaimer: I am not completely sure of METR's methodology for determining task length)
I think this issue of "9s" of reliability should update people towards longer timelines. Tesla FSD has basically been able to do everything individually that we would call self-driving for the last ~4 years, but it isn't 99.99...% reliable. I think LLMs replacing work will, by default, follow the same pattern.
Imo, this analogy breaks down if you take a holistic evolutionary lens. The amount of time you spent learning chess is minuscule compared to the amount of time evolution spent optimizing for creating the general learning machine that is your brain. It's not obvious how to cleanly analogize the current frontier model training recipe to evolution. But, I claim that your brain has certain inductive biases at birth that make it possible to eventually learn to do thing X, and directly training on thing X wouldn't have worked for evolution because the general model was just too bad.
"Gemini 3 estimates that there are 15-20k core ML academics and 100-150k supporting PhD students and Postdocs worldwide."
In my opinion, this seems way too high. What was the logic or assumptions it used?
- Land and buildings: 16.5B
- IT assets: 13.6B
Where are the GPUs (TPUs mostly in Google's case)? I figured these would be bigger given the capex of Google, MSFT, etc. on building enormous clusters.
I agree with most of the individual arguments you make, but this post still gives me "Feynman vibes." I generally think there should be a stronger prior on things staying the same for longer. I also think that the distribution of how AGI goes is so absurd, it's hard to reason about things like expectations for humans. (You acknowledge that in the post)
I know this is 7 months late! But I read this shortform yesterday and it somewhat resonated with me. And then today I read Noah Smith's most recent blog post which perfectly described what I think you're getting at so I'm linking it here.