Conclusion and Bibliography for "Understanding the diffusion of large language models"

Ben Cottier

This post is one part of the sequence Understanding the diffusion of large language models. As context for this post, I strongly recommend reading at least the 5-minute summary of the sequence.

Conclusion

In this sequence I presented key findings from case studies on the diffusion of eight language models that are similar to GPT-3. The phenomenon of diffusion has broad relevance to risks from TAI:

The diffusion of AI technology affects when TAI will be developed, and which actors will lead AI development by what margin. This in turn affects how safe the TAI systems are, how the systems are used, and what the state of global politics and economics is like when the systems are used.
Diffusion can have benefits, such as helping less-resourced actors to scrutinize leading AI developers, and supporting AI alignment research outside of leading industry AI labs.

GPT-3-like models are quite a specific domain, and may seem far from TAI. Nonetheless, I centered my research on case studies of GPT-3-like models because I think they are relatively informative about how diffusion will impact TAI development. In particular:

The way that diffusion works today (in broad terms) might persist until the development of TAI, especially if TAI is developed relatively soon (e.g., in the next 10 years).
TAI systems (or components of them) might resemble today’s best-performing language models, especially if the scaling hypothesis is true. So the implications of diffusion related to such models may be similar to the implications of diffusion related to transformative AI systems.
Even if a lot changes between now and TAI, the history of diffusion improves our understanding of what could happen.

My research has strong limitations, including that:

Much of the data from my case studies is highly uncertain, with quantitative estimates often spanning an order of magnitude.
I often generalize from a small set of case studies in a narrow domain. Some of my conclusions are not robust to counterexamples that I might discover in the future. However, I have tried my best to factor this possibility into my confidence levels.
Many of my bottom-line conclusions are not supported by much hard evidence, and are instead based on a combination of logical arguments and intuitions.

I think that the concept of diffusion is a productive framing to study competition, publication strategy, and other important dynamics of AI development. I’m excited for other researchers to continue work on diffusion. These are some of my recommended topics for future work (see this previous post for more):

Further evaluation of my proposals to limit access to datasets and algorithmic insights
The relevance and importance of diffusion mechanisms that were not involved in my case studies.
1. These mechanisms include theft or the leaking of information.
Case studies in other domains of AI.
1. This would be useful both to expand the overall amount of empirical data on diffusion, and to make comparisons to my existing case studies.
2. Notable candidates for study are AlphaGo Zero (game playing domain) and DALL-E (text-image domain).
How the publication strategy of emerging AI developers will shift as they grow.
How much deployment costs (rather than development costs) will limit the diffusion of (transformative) AI capabilities.
How much different inputs to AI development contribute to AI progress.
1. At various points in this sequence I presented my best guesses about the relative importance of different inputs to AI development, but I still have a lot of uncertainty that warrants further research.

Bibliography

AI21 Labs. (2022). Announcing AI21 Studio and Jurassic-1 Language Models. https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1

Ahmed, N., & Wahed, M. (2020). The De-Democratization of AI: Deep Learning and the Compute Divide in Artificial Intelligence Research. ArXiv. https://arxiv.org/abs/2010.15581

Aiken, C., Kagan, R., & Page, M. (2020). “Cool Projects” or “Expanding the Efficiency of the Murderous American War Machine?” AI Professionals’ Views on Working With the Department of Defense. Center for Security and Emerging Technology. https://cset.georgetown.edu/publication/cool-projects-or-expanding-the-efficiency-of-the-murderous-american-war-machine/

Alvi, A., & Kharya, P. (2021). Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model. Microsoft Research Blog. https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/

Anderljung, M. (2021). Compute Governance Ideas. Some AI Governance Research Ideas. https://docs.google.com/document/d/13LJhP3ksrcEBKxYFG5GkJaC2UoxHKUYAHCRdRlpePEc

[Anthony]. (2020). Date Weakly General AI is Publicly Known. Metaculus. https://perma.cc/P6KM-LZY9

Baidu Research. (2021). Introducing PCL-BAIDU Wenxin (ERNIE 3.0 Titan), the World’s First Knowledge Enhanced Multi-Hundred-Billion Model. http://research.baidu.com/Blog/index-view?id=165

Barnett, M. (2020). Date of Artificial General Intelligence. Metaculus. https://perma.cc/2UTN-PME7

Barr, J. (2019). Amazon EC2 Update - Inf1 Instances with AWS Inferentia Chips for High Performance Cost-Effective Inferencing. Amazon Web Services. https://aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-inferentia-chips-for-high-performance-cost-effective-inferencing/

Biderman, S., Bicheno, K., & Gao, L. (2022). Datasheet for the Pile. Eleuther AI. https://arxiv.org/pdf/2201.07311.pdf

BigScience. (2022). Introducing the World’s Largest Open Multilingual Language Model: BLOOM. https://bigscience.huggingface.co/blog/bloom

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., & Weinbach, S. (2022). GPT-NeoX-20B: An Open-Source Autoregressive Language Model. EleutherAI. https://arxiv.org/abs/2204.06745

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bogh, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., … Liang, P. (2021). On the Opportunities and Risks of Foundation Models. Center for Research on Foundation Models. https://arxiv.org/abs/2108.07258

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language Models are Few-Shot Learners. OpenAI. https://arxiv.org/abs/2005.14165

Carlsmith, J. (2022). Is Power-Seeking AI an Existential Risk?. Open Philanthropy. https://arxiv.org/abs/2206.13353

Bloem, P. (2019). Transformers from Scratch. Peterbloem.nl. https://peterbloem.nl/blog/transformers

Bostrom, N. (2019). The Vulnerable World Hypothesis. Global Policy. https://nickbostrom.com/papers/vulnerable.pdf

Buchanan, B., Musser, M., Lohn, A., & Sedova, K. (2021). Truth, Lies, and Automation: How Language Models Could Change Disinformation. Center for Security and Emerging Technology. https://cset.georgetown.edu/wp-content/uploads/CSET-Truth-Lies-and-Automation.pdf

Chen, H., Fu, C., Rouhani, B. D., Zhao, J., & Koushanfar, F. (2019). DeepAttest: An End-to-End Attestation Framework for Deep Neural Networks. Association for Computing Machinery. https://www.microsoft.com/en-us/research/uploads/prod/2019/05/DeepAttest.pdf

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberst, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., … Fiedel, N. (2022). PaLM: Scaling Language Modeling Pathways. Google Research. https://arxiv.org/pdf/2204.02311.pdf

Clare, S. (2021). Great Power Conflict. Founders Pledge. https://founderspledge.com/stories/great-power-conflict

Clark, J., Brundage, M., & Solaiman, I. (2019). GPT-2: 6-Month Follow-Up. OpenAI. https://openai.com/blog/gpt-2-6-month-follow-up/

Clifton, J. (2021). CLR’s Recent Work on Multi-Agent Systems. AI Alignment Forum. https://www.alignmentforum.org/posts/EzoCZjTdWTMgacKGS/clr-s-recent-work-on-multi-agent-systems

Etchemendy, J., & Li, F. (2020). National Research Cloud: Ensuring the Continuation of American Innovation. Human-Centered Artificial Intelligence. https://hai.stanford.edu/news/national-research-cloud-ensuring-continuation-american-innovation

Erdil, E., & Besiroglu, T. (2022). Algorithmic Progress in Computer Vision. Epoch. https://arxiv.org/pdf/2212.05153.pdf

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Google AI Language. https://arxiv.org/abs/1810.04805

Dillet, R. (2021). Hugging Face raises $40 million for its natural language processing library. TechCrunch. https://techcrunch.com/2021/03/11/hugging-face-raises-40-million-for-its-natural-language-processing-library/

Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Google. https://arxiv.org/abs/2101.03961

Field, H. (2022). How Microsoft and Google Use AI Red Teams to “Stress Test” Their Systems. Emerging Tech Brew. https://www.emergingtechbrew.com/stories/2022/06/14/how-microsoft-and-google-use-ai-red-teams-to-stress-test-their-system

[GAA] (2021). Nuclear Espionage and AI Governance. Effective Altruism Forum. https://forum.effectivealtruism.org/posts/CKfHDw5Lmoo6jahZD/nuclear-espionage-and-ai-governance-1

Ganguli, D., Hernandez, D., Lovitt, L., DasSarma, N., Henighan, T., Jones, A., Joseph, N., Kernion, J., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Elhage, N., Showk, S. E., Fort, S., … Clark, J. (2022). Predictability and Surprise in Large Generative Models. Association for Computing Machinery. https://arxiv.org/abs/2202.07785

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., & Leahy, C. (2020). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. EleutherAI. https://arxiv.org/abs/2101.00027

Gertler, A., Aird, M., [Leo], & [Pablo]. (2021). Credal Resilience. Effective Altruism Forum. https://forum.effectivealtruism.org/topics/credal-resilience

Gong, N. (2021). Model Stealing Attacks. Duke University. https://people.duke.edu/~zg70/courses/AML/Lecture14.pdf

Gwern.net. (2020). The Scaling Hypothesis. https://www.gwern.net/Scaling-hypothesis

H., D. (2020). How Much Did AlphaGo Zero Cost?. Dansplaining. https://www.yuzeh.com/data/agz-cost.html

Hao, K. (2020). The Messy, Secretive Reality Behind OpenAI’s Bid to Save the World. MIT Technology Review. https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/

Hernandez, D., & Brown, T. (2020). AI and Efficiency. OpenAI. https://openai.com/blog/ai-and-efficiency/

Hernandez, D., Brown, T., Conerly, T., DasSarma, N., Drain, D., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Henighan, T., Hume, T., Johnston, S., Mann, B., Olah, C., Olsson, C., Amodei, D., Joseph, N., Kaplan, J., & McCandlish, S. (2022). Scaling Laws and Interpretability of Learning from Repeated Data. Anthropic. https://arxiv.org/abs/2205.10487

Hobbhahn, M., & Besiroglu, T. (2022). Trends in GPU Price-Performance. Epoch. https://epochai.org/blog/trends-in-gpu-price-performance

Hobson, D. (2022). A Data Limited Future. LessWrong. https://www.lesswrong.com/posts/gqqhYijxcKAtuAFjL/a-data-limited-future

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. D. L., Hendricks, L.A. , Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. V. D., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., & Sifre, L. (2022). Training Compute-Optimal Large Language Models. DeepMind. https://arxiv.org/abs/2203.15556

Karnofsky, H. (2016). Some Background on Our Views Regarding Advanced Artificial Intelligence. Open Philanthropy. https://www.openphilanthropy.org/research/some-background-on-our-views-regarding-advanced-artificial-intelligence/

Karnofsky, H. (2021). AI Timelines: Where the Arguments, and the “Experts,” Stand. Cold Takes. https://www.cold-takes.com/where-ai-forecasting-stands-today/

Karnofsky, H. (2022). How Might We Align Transformative AI If It’s Developed Very Soon?. LessWrong. https://www.lesswrong.com/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. OpenAI. https://arxiv.org/abs/2001.08361

Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022). Transformers in Vision: A Survey. ACM Comput. Surv., 54(10s). https://dl.acm.org/doi/abs/10.1145/3505244

Khrushchev, M. (2022). Yandex Publishes YaLM 100B. It’s the Largest GPT-Like Neural Network in Open Source. Yandex. https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6

Kim, B., Kim, H., Lee, S., Gichang, L., Kwak, D., Jeon, D. H., Park, S., Kim, S., Kim, S., Seo, D., Lee, H., Jeong, M., Lee, S., Kim, M., Ko, S. H., Kim, S., Park, T., Kim, J., … Sung, N. (2021). What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers. Naver. https://arxiv.org/pdf/2109.04650.pdf

Ladish, J., & Heim, L. (2022). Information Security Considerations for AI and the Long Term Future. Effective Altruism Forum. https://forum.effectivealtruism.org/posts/WqQDCCLWbYfFRwubf/information-security-considerations-for-ai-and-the-long-term

Leahy, C. (2022). Announcing GPT-NeoX-20B. EleutherAI. https://blog.eleuther.ai/announcing-20b/

[lennart] (2021). Compute Governance and Conclusions - Transformative AI and Compute [¾]. Effective Altruism Forum. https://forum.effectivealtruism.org/posts/g6cwjcKMZba4RimJk/compute-governance-and-conclusions-transformative-ai-and

Leopold, G. (2019). AWS to Offer Nvidia’s GPUs for AI Inferencing. HPC Wire. https://www.hpcwire.com/2019/03/19/aws-upgrades-its-gpu-backed-ai-inference-platform/

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2020). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. Google. https://arxiv.org/pdf/2006.16668.pdf

Lieber, O., Sharir, O., Lenz, B., & Shoham, Y. (2021). Jurassic-1: Technical Details and Evaluation. https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf

Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., & Chen, W. (2021). What Makes Good In-Context Examples for GPT-3?. Microsoft Dynamics 365 AI. https://arxiv.org/abs/2101.06804

Lohn, A., & Musser, M. (2022). AI and Compute: How Much Longer Can Computing Power Drive Artificial Intelligence Progress?. Center for Security and Emerging Technology. https://cset.georgetown.edu/publication/ai-and-compute/

Muehlhauser, L. (2019). What Open Philanthropy Means by “Transformative AI”. Open Philanthropy. https://docs.google.com/document/d/15siOkHQAoSBl_Pu85UgEDWfmvXFotzub31ow3A11Xvo/edit

Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V. A., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., & Zaharia, M. (2021). Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. NVIDIA. https://arxiv.org/abs/2104.04473

Naver. (2021). Press Release: Naver Unveils Korea’s First Super-Scale AI ‘HyperCLOVA’... “We Will Lead the Era of AI for All”. https://www.navercorp.com/promotion/pressReleasesView/30546

[nostalgebraist] (2022). Chinchilla’s Wild Implications. LessWrong. https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications

OpenAI. (2022). Best Practices for Deploying Language Models. https://openai.com/blog/best-practices-for-deploying-language-models/

OpenAI. (2022). Powering Next Generation Applications with OpenAI Codex. https://openai.com/blog/codex-apps/

[Pablo], & [Leo]. (2021). AI Race. Effective Altruism Forum. https://forum.effectivealtruism.org/topics/ai-race

[Pablo], Aird, M., & [Leo]. (2021). Alignment Tax. Effective Altruism Forum. https://forum.effectivealtruism.org/topics/alignment-tax

Radford, A., & Narasimhan, K. (2018). Improving Language Understanding by Generative Pre-Training. Semantic Scholar. https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035

Radford, A., Wu, J., Amodei, D., Clark, J., Brundage, M., & Sutskever, I. (2019). Better Language Models and Their Implications. OpenAI. https://openai.com/blog/better-language-models/

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models Are Unsupervised Multitask Learners. Semantic Scholar. https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., Driessche, G., Hendricks, L. A., Rauh, M., Huang, P., …Irving, G. (2021). Scaling Language Models: Methods, Analysis & Insights from Training Gopher. DeepMind. https://arxiv.org/abs/2112.11446

Rae, J., Irving, G., & Weidinger, L. (2021). Language Modelling at Scale: Gopher, Ethical Considerations, and Retrieval. DeepMind. https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Google. https://arxiv.org/abs/1910.10683

Ramesh, A., Pavlov, M., Goh, G., & Gray, S. (2021). Dall-E: Creating Images from Text. OpenAI. https://openai.com/blog/dall-e/

Rosset, C. (2020). Turing-NLG: A 17-Billion-Parameter Language Model by Microsoft. Microsoft Research Blog. https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

Ryugen, H. (2022). Taiwan’s Share of Contract Chipmaking to Hit 66% This Year: Report. Nikkei Asia. https://asia.nikkei.com/Business/Tech/Semiconductors/Taiwan-s-share-of-contract-chipmaking-to-hit-66-this-year-report

Sandbrink, J., Hobbs, H., Swett, J., Dafoe, A., & Sandberg, A. (2022). Differential Technology Development: A Responsible Innovation Principle for Navigating Technology Risks. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4213670

Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Le Scao, T., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S., Szczechla, E., Kim, T., … Rush, A. M. (2021). Multitask Prompted Training Enables Zero-Shot Task Generalization. International Conference on Learning Representations. https://arxiv.org/abs/2110.08207

Schneider, J. (2022). War in Taiwan and AI Timelines. Effective Altruism Forum. https://forum.effectivealtruism.org/posts/PAxTSZPW7MBXKkvZg/war-in-taiwan-and-ai-timelines

Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, M., & Villalobos, P. (2022). Compute Trends Across Three Eras of Machine Learning. ArXiv. https://arxiv.org/abs/2202.05924

Sevilla, J., Heim, L., Hobbhahn, M., Besiroglu, T., Ho, A., & Villalobos, P. (2022). Estimating Training Compute of Deep Learning Models. Epoch. https://epochai.org/blog/estimating-training-compute#appendix-b-comparing-the-estimates-of-different-methods

Sevilla, J., Villalobos, P., Ceron, J. F., Burtell, M., Heim, L., Nanjajjar, A. B., Ho, A., Besiroglu, T., Hobbhahn, M., Denain, J., & Dudney, O. (2022). Parameter, Compute and Data Trends in Machine Learning. https://docs.google.com/spreadsheets/d/1AAIebjNsnJj_uKALHbXNfn3_YsT6sHXtCU0q7OIPuc4/edit#gid=1917852922

Shah, R. (2020). Alignment Newsletter #103 - ARCHES: An Agenda for Existential Safety, and Combining Natural Language with Deep RL. LessWrong. https://www.lesswrong.com/posts/gToGqwS9z2QFvwJ7b/an-103-arches-an-agenda-for-existential-safety-and-combining

Shaohua, W., Zhao, X., Yu, T., Zhang, R., Shen, C., Liu, H., Li, F., Zhu, H., Luo, J., Xu, L., & Zhang, X. (2021). Yuan 1.0: Large-Scale Pre-Trained Language Model in Zero-Shot and Few-Shot Learning. Inspur Artificial Intelligence Research Institute. https://arxiv.org/abs/2110.04725

Shelvane, T. (2022). The Artefacts of Intelligence: Governing Scientists’ Contribution to AI Proliferation. Centre for the Governance of AI. https://www.governance.ai/research-paper/the-artefacts-of-intelligence-governing-scientists-contribution-to-ai-proliferation

Shevlane, T. (2022). Structured access: an emerging paradigm for safe AI deployment. University of Oxford. https://arxiv.org/abs/2201.05159

Shelvane, T., & Dafoe, A. (2020). The Offense-Defense Balance of Scientific Knowledge: Does Publishing AI Research Reduce Misuse?. Future of Humanity Institute. https://www.fhi.ox.ac.uk/wp-content/uploads/The-Offense-Defense-Balance-of-Scientific-Knowledge.pdf

Shliazhko, O., Fenogenova, A., Tikhonova, M., Mikhailov, V., Kozlova, A., & Shavrina, T. (2022). mGPT: Few-Shot Learners Go Multilingual. ArXiv. https://arxiv.org/abs/2204.07580

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. NVIDIA. https://arxiv.org/abs/1909.08053

Silver, D., & Hassabis, D. (2017). AlphaGo Zero: Starting from Scratch. DeepMind. https://www.deepmind.com/blog/alphago-zero-starting-from-scratch

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., & Hassabis, D. (2017). Mastering the Game of Go Without Human Knowledge. DeepMind. https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf

Soltan, S., Ananthakrishnan, S., FitzGerald, J., Gupta, R., Hamza, W., Khan, H., Peris, C., Rawls, S., Rosenbaum, A., Rumshisky, A., Prakash, C. S., Sridhar, M., Triefenbach, F., Verma, A., Tur, G., & Natarajan, P. (2022). AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model. Amazon Alexa AI. https://arxiv.org/abs/2208.01448

Sutton, R. (2019). The Bitter Lesson. Incomplete Ideas. http://incompleteideas.net/IncIdeas/BitterLesson.html

Tian, Y., Ma, J., Gong, Q., Sengupta, S., Chen, Z., Pinkerton, J., & Zitnick, L. (2019). ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero. 36th International Conference on Machine Learning. https://arxiv.org/abs/1902.04522

Tramèr, F., Zhang, F., Juels, A., Reiter, M. K., & Ristenpart, T. (2016). Stealing Machine Learning Models via Prediction APIs. 25th Usenix Security Symposium. https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/tramer

Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H., Jin, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H. S., Ghafouri, A., Menegali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J., Chen, D., … Le, Q. (2022). LaMDA: Language Models for Dialog Applications. Google. https://arxiv.org/abs/2201.08239

Tsinghua University. (2022). GLM-130B: An Open Bilingual Pre-Trained Model. http://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. & Polosukhin, I. (2017). Attention Is All You Need. 31st Conference on Neural Information Processing Systems. https://arxiv.org/abs/1706.03762

Villalobos, P., Sevilla, J., Heim, L., Besiroglu, T., Hobbhahn, M., & Ho, A. (2022). Will We Run Out of ML Data? Evidence From Projecting Dataset Size Trends. Epoch. https://epochai.org/blog/will-we-run-out-of-ml-data-evidence-from-projecting-dataset

Wang, S., Sun, Y., Xiang, Y., Wu, Z., Ding, S., Gong, W., Feng, S., Shang, J., Zhao, Y., Pang, C., Liu, J., Chen, X., Lu, Y., Wang, X., Bai, Y., Chen, Q., Zhao, L., Li, S., … Wang, H. (2021). Ernie 3.0 Titan: Exploring Larger-Scale Knowledge Enhanced Pre-Training for Language Understanding and Generation. Baidu Inc. https://arxiv.org/pdf/2112.12731.pdf

Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2021). Finetuned Language Models Are Zero-Shot Learners. Google Research. https://arxiv.org/pdf/2109.01652v1.pdf

Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned Language Models Are Zero-Shot Learners. Google Research. https://arxiv.org/pdf/2109.01652.pdf

Wiblin, R., & Harris, K. (2022). Nova DasSarma on Why Information Security May Be Critical to the Safe Development of AI Systems. 80,000 Hours. https://80000hours.org/podcast/episodes/nova-dassarma-information-security-and-ai-systems/

Wiggers, K. (2021). AI21 Labs trains a massive language model to rival OpenAI’s GPT-3. VentureBeat. https://venturebeat.com/business/ai21-labs-trains-a-massive-language-model-to-rival-openais-gpt-3/

Wiggers, K. (2022). OpenAI Rival AI21 Labs Raises $64M to Ramp Up its AI-Powered Languages Services. TechCrunch. https://techcrunch.com/2022/07/12/openai-rival-ai21-labs-raises-64m-to-ramp-up-its-ai-powered-language-services/

Wu, S., Zhao, X., Yu, T., Zhang, R., Shen, C., Liu, H., Li, F., Zhu, H., Luo, J., Xu, L., & Zhang, X. (2021). Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning. Inspur Artificial Intelligence Research Institute. https://arxiv.org/abs/2110.04725

Zeng, A., Liu, X., Du Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X., Tam, W. L., Ma, Z., Xue, Y., Zhai, J., Chen, W., Zhang, P., Dong, Y., & Tang, J. (2022). GLM-130B: An Open Bilingual Pre-trained Model. Tsinghua University. https://arxiv.org/pdf/2210.02414.pdf

Zeng, W., Ren, X., Su, T., Wang, H., Liao, Y., Wang, Z., Jiang, X., Yang, Z., Wang, K., Zhang, X., Li, C., Gong, Z., Yao, Y., Huang, X., Wang, J., Yu, J., Guo, Q., Yu, Y., Zhang, Y., … Tian, Y. (2021). PanGu-α: Large-Scale Autoregressive Pretrained Chinese Language Models With Auto-Parallel Computation. PanGu-α Team. https://arxiv.org/pdf/2104.12369.pdf

Zhang, S., Diab, M., & Zettlemoyer, L. (2022). Democratizing Access to Large-Scale Language Models with OPT-175B. Meta AI. https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, Xi V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., S., Wang, T. & Zettlemoyer, L. (2022). OPT: Open Pre-trained Transformer Language Models. Meta AI. https://arxiv.org/abs/2205.01068

Zwetsloot, R., & Dafoe, A. (2019). Thinking About Risks From AI: Accidents, Misuse and Structure. Lawfare. https://www.lawfareblog.com/thinking-about-risks-ai-accidents-misuse-and-structure

Acknowledgements

In addition to feedback-givers, I'd like to thank:

My manager at Rethink Priorities, Michael Aird, for helping me become a better researcher throughout this project. Michael’s support, advice, and feedback were crucial to improving and finishing this sequence.
Rethink Priorities for supporting me to do this project.
All of the experts who responded to my questions.
Adam Papineau for copyediting.

This research is a project of Rethink Priorities. It was written by Ben Cottier. Thanks to Alexis Carlier, Amanda El-Dakhakhni, Ashwin Acharya, Ben Snodin, Bill Anderson-Samways, Erich Grunewald, Jack Clark, Jaime Sevilla, Jenny Xiao, Lennart Heim, Lewis Ho, Lucy Lim, Luke Muehlhauser, Markus Anderljung, Max Räuker, Micah Musser, Michael Aird, Miles Brundage, Oliver Guest, Onni Arne, Patrick Levermore, Peter Wildeford, Remco Zwetsloot, Renan Araújo, Shaun Ee, Tamay Besiroglu, and Toby Shevlane for helpful feedback. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.

LESSWRONG
LW

LESSWRONG
LW

4

Conclusion and Bibliography for "Understanding the diffusion of large language models"

4

Conclusion

Bibliography

Acknowledgements

4

4