What are your priors, or what is your base rate calculation, for how often promising new ML architectures end up scaling and generalizing well to new tasks?
I think something you aren't mentioning is that at least part of the reason the definition of AGI has gotten so decoupled from its original intended meaning is that the current AI systems we have are unexpectedly spiky.
We knew that it was possible to create narrow ASI or AGI for a while; chess engines did this in 1997. We thought that a system which could do the broad suite of tasks that GPT is currently capable of doing would necessarily be able to do the other things on a computer that humans are able to do. This didn't really happen. GPT is already superhuman in some ways, and maybe superhuman for about ~50% of economically viable tasks that are done via computer, but it still makes mistakes at very other basic things.
It's weird that GPT can name and analyze differential equations better than most people with a math degree, but be unable to correctly cite a reference. We didn't expect that.
Another difficult thing about defining AGI is that we actually expect better than "median human level" performance, but not necessarily in an unfair way. Most people around the globe don't know the rules of chess, but we would expect AGI to be able to play at roughly the ~1000 elo level. Let's define AGI as being able to perform every computer task at the level of the median human who has been given one month of training. We haven't hit that milestone yet. But we may well blow past superhuman performance on a few other capabilities before we get to that milestone.
I'm not sure if I agree that this idea with the dashed lines, of being unable to transition "directly," is coherent or not. A more plausible structure seems to me like a transitive relation for the solid arrows. If A->B and B->C, then there exists an A->C.
Again, what does it mean to be unable to transition "directly?" You've explicitly said we're ignoring path depencies and time, so if an agent can go from A to B, and then from B to C, I claim that this means there should be a solid arrow from A to C.
Of course, in real life, sometimes you have to sacrifice in the short term to reach a more preferred long term state. But by the framework we set up, this needs to be "brought into the diagram" (to use your phrasing.)
Reasoning models are better than basically all human mathematicians
What do you mean by this? Although I concede that 95%+ of all humans are not very good at math, for those I would call human mathematicians I would say that reasoning models are better than basically 0% of them. (And I am aware of the Frontier Math benchmarks.)
In general, I think the ideas is that you first get a superhuman coder, then you get a superhuman AI researcher, then you get a any-task superhuman researcher, and then you use this superhuman researcher to solve all of the problems we have been discussing in lightning fast time.
A simple idea here would be to just pay machinists and other blue collar workers to wear something like a Go-Pro (and possibly even other sensors along the body that track movement) all throughout the day. You then use the same architectures that currently predict video to predict what the blue collar worker does during their day.
Thank you very much, this is so helpful! I want to know if I am understanding things correctly again, so please correct me if I am wrong on any of the following:
By "used for inference," this just means basically letting people use the model? Like when I go to the chatgpt website, I am using the datacenter campus computers that were previously used for training? (Again, please forgive my noobie questions.)
For 2025, Abilene is building a 100,000-chip campus. This is plausibly around the same number of chips that were used to train the~3e26 FLOPs GPT4.5 at the Goodyear campus. However, the Goodyear campus was using H100 chips, but Abilene will be using Blackwell NVL72 chips. These improved chips means that for the same number of chips we can now train a 1e27 FLOPs model instead of just a 3e26 model. The chips can be built by summer 2025, and a new model trained by around end of year 2025.
1.5 years after the Blackwell chips, the new Rubin chip will arrive. The time is now currently ~2027.5.
Now a few things need to happen:
More realistically, assuming (1) and (2) hold, it makes more sense to wait until the Rubin Ultra comes out to spend the $150bn on.
Or, some type of mixed buildout would occur, some of that $150bn in 2027.5 would use the Rubin non-Ultra to train a 2e28 FLOPs model, and the remainder would be used to build an even bigger model in 2028 that uses Rubin Ultra.
Do I have the high level takeaways here correct? Forgive my use of the phrase "Training size," but I know very little about diferent chips, so I am trying to distill it down to simple numbers.
2024:
a) OpenAI revenue: $3.7 billion.
b) Training size: 3e26 to 1e27 FLOPs.
c) Training cost: $4-5 billion.
2025 Projections:
a) OpenAI revenue: $12 billion.
b) Training size: 5e27 FLOPs.
b) Training cost: $25-30 billion.
2026 Projections:
a) OpenAI revenue: ~$36 billion to $60 billlion.
At this point I am confused: why you are saying Rubin arriving after Blackwell would make the revenue more like $60 billion? Again, I know very little about chips. Wouldn't the arrival of a different chip also change OpenAIs cost?
b) Training size: 5e28 FLOPs.
c) Training cost: $150 billion.
Assuming investors are willing to take the same ratio of revenue : training cost as before, this would predict $70 billion to $150 billion. In other words, to get to the $150 billion mark requires that Rubin arrives after Blackwell, openAI makes revenue $60 billion in revenue, and investors take a 2.5 multiplier for $60 x 2.5 = $150 billion.
Is there anything else that I missed?
For what it is worth, the later parts of the book discuss the things you might be more intersted in, like meditative/path models. The scientific research is quite interesting, in particular, I find the brain scans of monks to be incredible.
Why do language models hallucinate? https://openai.com/index/why-language-models-hallucinate/
There is a new paper from OpenAI. It is mostly basic stuff. I have a question though, and thought this would be a good place to ask. Have any model trainings worked with outputing subjective probabilities for their answers? And have labs started to apply the reasoning traces to the pretraining tasks as well?
One could make the models "wager", and then reward in line with the wager. The way this is typically done is based on the Kelly criterion and uses logarithimc scoring. The logarithmic scoring rule awards you points based on the logarithm of the probability you assigned to the correct answer.
For example, for two possibilites A and B if answer A turns out to be correct, your score is: SA=log(p). If answer B turns out to be correct, your score is: SB=log(1−p). Usually a constant is added to make the scores positive. For example, the score could be. The key feature of this rule is that you maximize your expected score by reporting your true belief. If you truly believe there is an 80% chance that A is correct, your expected score is maximized by setting p=0.8.
I recall seeing somewhere on the internet a while back of a decision theory course, where for the exam itself students were required to output their confidence in their answer, and they would be awarded points accordingly.
What about doing the following: get rid of the distinction between post-training and pre-training. Make predicting text an RL task and allow reasoning. At the end of the reasoning chain output a subjective probability and award in accordance with the logarithmic scoring rules.e