Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I'm talking about these agents (LW thread here)

I'd love an answer either in operations (MIPS, FLOPS, whatever) or in dollars.

Follow-up question: How many parameters did their agents have?

I just read the paper (incl. appendix) but didn't see them list the answer anywhere. I suspect I could figure it out from information in the paper, e.g. by adding up how many neurons are in their LSTMs, their various other bits, etc. and then multiplying by how long they said they trained for, but I lack the ML knowledge to do this correctly.

Some tidbits from the paper:

For multi-agent analysis we took the final generation of the agent(generation5)andcreatedequallyspacedcheckpoints (copies of the neural network parameters) every 10 billion steps, creating a collection of 13 checkpoints.

This suggests 120 billion steps of training for the final agents. But elsewhere in the post they state each agent in the final generation experienced 200 billion training steps, so.... huh?

Anyhow. Another tidbit:

In addition to the agent exhibiting zero-shot capabilities across a wide eval-uation space, we show that finetuning on a new task for just 100 million steps (around 30 minutes of compute in our setup) can lead to drastic increases in performance relative to zero-shot, and relative to training from scratch which often fails completely.

So, if 100 million steps takes 30min in their setup, and they did 200 billion steps for the final generation, that means the final generation took 30 x 2,000 = 41 days. Makes sense. So the whole project probably took something like 100 - 200 days, depending on whether generations 1 - 4 were quicker.

How much does that cost though??? In dollars or FLOPs? I have no idea.

EDIT: It says each agent was trained on 8 TPUv3's. But how many agents were there? I can't find anything about the population size. Maybe I'm not looking hard enough.

New Answer
New Comment

1 Answers sorted by

I have a guesstimate for number of parameters, but not for overall compute or dollar cost:

Each agent was trained on 8 TPUv3's, which cost about $5,000/mo according to a quick google, and which seem to produce 90 TOPS, or about 10^14 operations per second. They say each agent does about 50,000 steps per second, so that means about 2 billion operations per step. Each little game they play lasts 900 steps if I recall correctly, which is about 2 minutes of subjective time they say (I imagine they extrapolated from what happens if you run the game at a speed such that the physics simulation looks normal-speed to us). So that means about 7.5 steps per subjective second, so each agent requires about 15 billion operations per subjective second.

So... 2 billion operations per step suggests that these things are about the size of GPT-2, i.e. about the size of a rat brain? If we care about subjective time, then it seems the human brain maybe uses 10^15 FLOP per subjective second, which is about 5 OOMs more than these agents.

Your link says rats have ~200 million neurons, but I think synapses are a better comparison for NN parameters. After all, both synapses and parameters roughly store how strong the connections between different neurons are.

Using synapse count, these agents are closer to guppies than to rats.

The TOPS numbers from the wiki page seem wrong. TPUv1 had 92 TOPS (uint8); for TPUv3 the "90 TOPS" refers to a single chip, but I'm fairly sure that when the paper says "8 TPUv3s" they mean 8 cards, as that's how they are available on Google Cloud (1 card = 4 chips).

2Daniel Kokotajlo3y
Huh, thanks! I guess my guesstimate is wrong then. So should I multiply everything by 8?

Do you mind sharing your guesstimate on number of parameters?

Also, do you have per chance guesstimates on number of parameters / compute of other systems?

2Daniel Kokotajlo3y
I did, sorry -- I guesstimated FLOP/step and then figured parameters is probably a bit less than 1 OOM less than that. But since this is recurrent maybe it's even less? IDK. My guesstimate is shitty and I'd love to see someone do a better one!

Also for comparison, I think this means these models were about twice as big as AlphaStar. That's interesting.

Michael Dennis tells me that population-based training typically sees strong diminishing returns to population size, such that he doubts that there were more than one or two dozen agents in each population/generation. This is consistent with AlphaStar I believe, where the number of agents was something like that IIRC...

Anyhow, suppose 30 agents per generation. Then that's a cost of $5,000/mo x 1.3 months x 30 agents = $195,000 to train the fifth generation of agents. The previous two generations were probably quicker and cheaper. In total the price is prob... (read more)

Makes sense given the spinning-top topology of games. These tasks are probably not complex enough to need a lot of distinct agents/populations to traverse the wide part to reach the top where you then need little diversity to converge on value-equivalent models. One observation: you can't run SC2 environments on a TPU, and when you can pack the environment and agents together onto a TPU and batch everything with no copying, you use the hardware closer to its full potential, see the Podracer numbers.
3Julian Schrittwieser3y
Only Anakin actually runs the environment on the TPU, and this only works for pretty simple environments (basically: can you implement it in JAX?) Sebulba runs environments on the host, which is what would have been done for this paper too (no idea if they used Sebulba or had a different setup). This doesn't really matter though, because for these simulated environments it's fairly simple to fully utilize the TPUs by running more (remote) environments in parallel. 
Yes, I see that they used Unity, so the TPUs themselves couldn't run the env, but the TPU CPU VM* could run potentially a lot of copies (with that like 300GB of RAM it's got access to), and that'd be a lot nicer than running remote VMs. At least in Tensorfork, when we try to use TPU pods, a lot of time goes into figuring out correct use of the interconnect & traffic because the on-TPU ops are so optimized by default. (And regardless of which of those tricks this open-ended paper uses, this is a point well worth knowing about how research could potentially gets way more performance out of a TPU pod than one would expect from knowing TPU usage of old stuff like AlphaStar.) * advertisement: access to the VM was recently unlocked for non-Google TPU users. It really changes how you treat TPU use!