Wiki Contributions

Comments

Thank for the great comment!

Do we know if distributed training is expected to scale well to GPT-6 size models (100 trillions parameters) trained over like 20 data centers? How does the communication cost scale with the size of the model and the number of data centers? Linearly on both?

After reading for 3 min this:
Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips (Google November 2023). It seems that scaling is working efficiently at least up to 50k GPUs (GPT-6 would be like 2.5M GPUs). There are also some surprising linear increases in start time with the number of GPUs, 13min for 32k GPUs. What is the SOTA?

The title is clearly an overstatement. It expresses more that I updated in that direction, than that I am confident in it. 

Also, since learning from other comments that decentralized learning is likely solved, I am now even less confident in the claim, like only 15% chance that it will happen in the strong form stated in the post.

Maybe I should edit the post to make it even more clear that the claim is retracted.

This is actually corrected on the Epoch website but not here (https://epochai.org/blog/the-longest-training-run)

We could also combine this with the rate of growth of investments. In that case we would end up with a total rate of growth of effective compute equal to . This results in an optimal training run length of  years, ie  months.

 

Why is g_I here 3.84, while above it is 1.03?

Are memoryless LLMs with a limited context window, significantly open loop? (Can't use summarization between calls nor get access to previous prompts)

FYI, the "Evaluating Alignment Evaluations" project of the current AI Safety Camp is working on studying and characterizing alignment(propensity) evaluations. We hope to contribute to the science of evals, and we will contact you next month. (Somewhat deprecated project proposal)

Interesting! I will see if I can correct that easily.

Thanks a lot for the summary at the start!

I wonder if the result is dependent on the type of OOD.

If you are OOD by having less extractable information, then the results are intuitive. 
If you are OOD by having extreme extractable information or misleading information, then the results are unexpected.

Oh, I just read their Appendix A: "Instances Where “Reversion to the OCS” Does Not Hold"
Outputting the average prediction is indeed not the only behavior OOD. It seems that there are different types of OOD regimes.

This comes from OpenAI saying they didn't expect ChatGPT to be a big commercial success. It was not a top-priority project. 

Load More