A Breakdown of AI Chip Companies

"I have heard that they get the details wrong though, and the fact that they [Groq] are still adversing their ResNet-50 performance (a 2015 era network) speaks to that."

I'm not sure I fully get this criticism: ResNet-50 is the most standard image recognition benchmark and unsurprisingly it's the only (?) architecture that NVIDIA lists in their benchmarking stats for image recognition as well:

Experiments with a random clock

This is a very neat idea, is there any easy way to enable this for Android and Google Calendar notifications? I guess not

What are the long-term outcomes of a catastrophic pandemic?

Kudos for writing this list! It helped me think about the pandemic better in March 2020 and milder versions of at least some points now seem a part of reality.

Alcohol, health, and the ruthless logic of the Asian flush

Yep, the first google result http://xn--80akpciegnlg.xn--p1ai/preparaty-dlya-kodirovaniya/disulfiram-implant/ (in Russian) says that you use an implant with 1-2g of the substance for up to 5-24 months and that "the minimum blood level of disulfiram is 20 ng/ml; ". This paper says "Mild effects may occur at blood alcohol concentrations of 5 to 10 mg/100 mL."

Scott Alexander 2021 Predictions: Market Prices

Ethereum above 0.05 BTC: 70%

This already happened today (a day after this post).

I would have put this waay higher due to value proposition of ethereum + massive ethereum ecosystem + the fact that it hasn't rallied that much yet against BTC compared to its 2017 values + bright future plans for ethereum + competitors forced to integrate with ethereum and lacking some of its properties. IDK if these are objectively good reason for expecting growth but they are there in my personal model.

Discussion of concrete near-to-middle term trends in AI

The prediction about CV doesn't seem to have aged that well in my view. Others are going fairly well!

"AI and Compute" trend isn't predictive of what is happening

gwern has recently remarked that one cause of this is supply and demand disruptions and this may be a temporary phenomenon in principle.

"AI and Compute" trend isn't predictive of what is happening

I appreciate questioning of my calculations, thanks for checking!

This is what I think about the previous avturchin calculation: I think that may have been a misinterpretation of DeepMind blogpost. In the blogpost they say "The AlphaStar league was run for 14 days, using 16 TPUs for each agent". But I think it might not be 16 TPU-days for each agent, it's 16 TPU for 14/n_agent=14/600 days for each agent. And 14 days was for the whole League training where agent policies were trained consecutively. Their wording is indeed not very clear but you can look at the "Progression of Nash of AlphaStar League" pic. You can see there that, as they say, "New competitors were dynamically added to the league, by branching from existing competitors", and that the new ones drastically outperform older ones, meaning that older ones were not continuously updated and were only randomly picked up as static opponents.

From the blogpost: "A full technical description of this work is being prepared for publication in a peer-reviewed journal". The only publication about this is their late-2019 Nature paper linked by teradimich here which I have taken the values from. They have upgraded their algorithm and have spent more compute in a single experiment by October 2019. 12 agents refers to the number of types of agents and 600 (900 in the newer version) refers to the number of policies. About the 33% GPU utilization value - I think I've seen it in some ML publications and in other places for this hardware, and this seems like a reasonable estimate for all these projects, but I don't have sources at hand.

"AI and Compute" trend isn't predictive of what is happening

My calculation for AlphaStar: 12 agents * 44 days * 24 hours/day * 3600 sec/hour * 420*10^12 FLOP/s * 32 TPUv3 boards * 33% actual board utilization = 2.02 * 10^23 FLOP which is about the same as AlphaGo Zero compute.

For 600B GShard MoE model: 22 TPU core-years = 22 years * 365 days/year * 24 hours/day * 3600 sec/hour * 420*10^12 FLOP/s/TPUv3 board * 0.25 TPU boards / TPU core * 0.33 actual board utilization = 2.4 * 10^21 FLOP.

For 2.3B GShard dense transformer: 235.5 TPU core-years = 2.6 * 10^22 FLOP.

Meena was trained for 30 days on a TPUv3 pod with 2048 cores. So it's 30 days * 24 hours/day * 3600 sec/hour * 2048 TPUv3 cores * 0.25 TPU boards / TPU core * 420*10^12 FLOP/s/TPUv3 board * 33% actual board utilization = 1.8 * 10^23 FLOP, slightly below AlphaGo Zero.

Image GPT: "iGPT-L was trained for roughly 2500 V100-days" - this means 2500 days * 24 hours/day * 3600 sec/hour * 100*10^12 * 33% actual board utilization = 6.5 * 10^9 * 10^12 = 6.5 * 10^21 FLOP. There's no compute data for the largest model, iGPT-XL. But based on the FLOP/s increase from GPT-3 XL (same num of params as iGPT-L) to GPT-3 6.7B (same num of params as iGPT-XL), I think it required 5 times more compute: 3.3 * 10^22 FLOP.

BigGAN: 2 days * 24 hours/day * 3600 sec/hour * 512 TPU cores * 0.25 TPU boards / TPU core * 420*10^12 FLOP/s/TPUv3 board * 33% actual board utilization = 3 * 10^21 FLOP.

AlphaFold: they say they trained on GPU and not TPU. Assuming V100 GPU, it's 5 days * 24 hours/day * 3600 sec/hour * 8 V100 GPU * 100*10^12 FLOP/s * 33% actual GPU utilization = 10^20 FLOP.

On language modeling and future abstract reasoning research

Here's a list of papers related to reasoning and RL for language models that were published in fall 2020 and that have caught my eye - you may also find it useful if you're interested in the topic.

Learning to summarize from human feedback - finetune GPT-3 to generate pieces of text to accomplish a complex goal, where performance ratings are provided by humans.
Keep CALM and Explore: Language Models for Action Generation in Text-based Games - an instance of the selector approach where a selector chooses between generated text candidates, similarly to "GeDi: Generative Discriminator Guided Sequence Generation". But here they actually use it for RL training.

Graph-based Multi-hop Reasoning for Long Text Generation - two-stage approach to language modeling where on the 1st stage you process a knowledge graph corresponding to the context to obtain paths between the concepts, on the 2nd stage you generate text incorporating these paths. You don't need to have graph data, they can be built from text automatically. Seems to generate texts that are more diverse, informative and coherent compared to plain transformers. Seems like a quick fix for the problem of language models where they don't easily have coherent generation intents from letter to letter.

New losses
Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval - a new transformer-based retrieval model (retrieves an answer to a question by predicting the location of an answer in a pool of documents). This one is a multi-hop model which means that it searches for answers iteratively using information gathered during previous searches. Retrieval models have been successfully combined with text generation in the past to boost question answering performance.
MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale - take data from 140 StackExchange forums, train a model to match questions to answers. Performs well at answer selection in other domains unrelated to StackExchange.

Beyond Language: Learning Commonsense from Images for Reasoning - previously lots of methods used images + text in transformers to do e.g. visual reasoning. This one differs in that it shows that even if images are not present at test time, commonsense reasoning is still improved.
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision - solve exactly the same problem as "Beyond Language: Learning Commonsense from Images for Reasoning" which is to use images at training-time to benefit text-only generation.

Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning - a language model training method that allows to generate missing text based on the past and the future. In addition to giving more control over the generation, it also improves abductive reasoning (hypothesis generation).
GeDi: Generative Discriminator Guided Sequence Generation - similar to the selector RL training idea I described for controlled generation, but without RL training. After generating a bunch of continuations with a generator, apply a selector trained in a different way to choose between them. This is a more sophisticated way to control generation compared to programming via prompts.
Summarize, Outline, and Elaborate : Long-Text Generation via Hierarchical Supervision from Extractive Summaries - introduces a sampling strategy for text generative models where it first generates a high-level plan of the text with summaries of passages, and then generates the passages. Improves the training efficiency by a lot, and improves the likelihood of generated text as well.

Also here's list of some earlier papers that I found interesting:

Analyzing mathematical reasoning abilities of neural models (transformers for symbolic math reasoning, Apr'19)
Transformers as Soft Reasoners over Language (evaluating logical reasoning in natural language domain, May'19)
Teaching Temporal Logics to Neural Networks (transformers for logical inference, Jun'19)
REALM: Retrieval-Augmented Language Model Pre-Training (Google, Feb'20)
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (FAIR, Jul'20)

Load More