Posts

Sorted by New

Wiki Contributions

Comments

The population growth problem should be somewhat addressed by healthspan extension. A big reason as to why people aren't having kids now is that they lack the resources - be it housing, money, or time. If we could extend the average healthspan by a few decades, then older people who have spent enough time working to accumulate those resources, but are too old to raise children, should now be able have kids. Moreover, it means that people who are already have many kids but have just become too old will also be able to have more. For those reasons, I don't think a future birth limit of 30 billion is particularly reasonable.

However, I don't think it will make a difference, at least for addressing AI. Once computing reaches a certain level of advancement, it will simply be unfeasible for something the size of a human brain, no matter how enhanced, to compete with a superintelligence running on a supercomputer the size of a basketball court. And that level of computing/AI advancement will almost certainly be achieved before the discussed genetic enhancement will ever bear fruit, probably even before it's made legal. Moreover, it's doubtful we'll see any significant healthspan extensions particularly long before achieving ASI, so that makes it even less relevant, although I don't think any of these concerns were particularly significant in the first place as it also seems like we'll see ASI long before global population decline.

Sorry for the late reply, but yeah, it was mostly vibes based on what I'd seen before. I've been looking over the benchmarks in the Technical Report again though, and I'm starting to feel like 500B+10T isn't too far off. Although language benchmarks are fairly similar, the improvements in mathematical capabilities over the previous SOTA is much larger than I first realised, and seem to match a model of that size considering the conventionally trained PaLM and its derivatives' performances.

Apparently all OPT models were trained with a 2k token context length. So based on this, assuming basic O(n^2) scaling, an 8k token version of the 175B model would have the attention stage scale to about 35% of the FLOPS, and a 32k token version would scale to almost 90% of the FLOPS. 8k tokens is somewhat excusable, but 32k tokens is still overwhelmingly significant even with a 175B parameter model, costing around 840% more compute than a 2k token model. That percentage will probably only drop to a reasonable level at around the 10T parameter model level, provided O(n^2) scaling at least. And that's all assuming the other aspects of the model don't scale at all with the larger context length... A new approach is definitely going to be needed soon. Maybe H3?

I always get annoyed when people use this as an example of 'lacking intelligence'. Though it certainly is in part an issue with the model, the primary reason for this failure is much more likely the tokenization process than anything else. A GPT-4, likely even a GPT-3, trained with character-level tokenization would likely have zero issues answering these questions. It's for the same reason that the base GPT-3 struggled so much with rhyming for instance.

According to the Chinchilla paper, a compute-optimal model of that size should have ~500B parameters and have used ~10T tokens. Based on its GPT-4's demonstrated capabilities though, that's probably an overestimate.