GPT-175bee

Adam Scherlis; LawrenceC

Epistemic status: whimsical

Bees: a new unit of measurement for ML model size

Talking about modern ML models inevitably leads to a bunch of hard-to-intuit large numbers, especially when it comes to parameter count.

To address this, we propose that we adopt a new, human-friendly unit to measure the number of learnable parameters in an architecture:

1 beepower = 1 BP = 1 billion parameters

Bees have about one billion^[1] synapses^[2] in their forebrain^[3], so this gives a nice basis for comparisons^[4] between animal brains and artificial neural nets.

Like horsepower and candlepower,^[5] the unit of beepower expresses the scale of a new and unfamiliar technology in terms that we are already familiar with. And it makes discussion of model specs flow better.

"This model has twenty bees", you might say. Or "wow, look at all that beepower; did you train long enough to make good use of it?"

Here's a helpful infographic to calibrate you on this new unit of measurement:

The parameter count of various recent language models, denoted in beepower.

Other animals

We can even benchmark^[6] against more or less brainy animals.

The smallest OpenAI API model, Ada, is probably^[7] 350 million parameters, or about a third of a bee, which is comparable to a cricket:

Blog - Why Are There Crickets That Keep Getting Into My Chicago Home? — While Jiminy Cricket can compose better English than Ada, this cricket cannot.

The next size up, Babbage, is around 1.3 BP, or cockroach-sized.

Curie has almost seven bees, which is... sort of in an awkward gap between insects and mammals.

Davinci is a 175-bee model, which gets us up to hedgehog (or quail) scale:

16 Fun Facts About Hedgehogs | Mental Floss — As a large language model trained by OpenAI, I don't have the ability to "be the cutest little guy oh my gosh"

Gopher (280 BP) is partridge (or ferret) sized. More research into actual gophers is needed to know how many gophers worth of parameters Gopher has.

Grey partridge - Wikipedia — Really, they should've named Gopher "Partridge" or "Ferret"!

Amusingly, PaLM, at 540 bees, has about as many parameters as a chinchilla has synapses:^[8]

We think PaLM has about one chinchilla worth of parameters. This isn't confusing at all.

Tragically, we could not figure out how many palms worth of parameters Chinchilla (70 bees) has. We leave this as an exercise for the reader.

^{^}
There are about 170,000 neurons in the corpora pendiculata of a honeybee, or roughly 140,000 after adjusting for the tendency of optical fractionators to overcount, and some sources give about 7,000 synapses per neuron for the human brain, and it turns out humans and mice have comparable synapse-per-neuron counts so it doesn't scale that badly with brain size; skeptical readers are encouraged to shut up and multiply.
^{^}
Is one synapse equivalent to one parameter? Well, there are about five bits of recoverable information encoded in the strength of a synaptic connection, and neural-net parameters can be compressed to eight bits (or even 4 bits!) without too much loss of performance, so kinda-ish yeah.
^{^}
Wikipedia claims the corpora pendiculata ("mushroom bodies") of insects, which "are known to play a role in olfactory learning and memory", are analogous to the mammalian cerebral cortex or the avian hypopallium. Sure, why not.
^{^}
The thing about apples and oranges is that nobody can actually stop you from comparing them.
^{^}
A hundred-watt incandescent gets about a thousand candlepower per horsepower; LEDs can get 6,000 candles per horse or even more.
^{^}
Based on "forebrain" neuron numbers from this Wikipedia article, assuming 7,000 synapses per neuron, with optical fractionator counts discounted by a factor of about .8 based on the pairwise comparisons available there.
^{^}
https://blog.eleuther.ai/gpt3-model-sizes/
^{^}
Due to a lack of interest in studying chinchillas, there doesn't seem to have been a direct measurement of the synapse or neuron count for chinchillas. That being said, rabbits have around 500 billion synapses, and (domesticated) Chinchillas are around the same size and body weight as (smaller) rabbits and have the same cerebellum weight, so we feel justified in making this claim anyways. :)

Addendum: a human neocortex has on the order of 140 trillion synapses, or 140,000 bees. An average beehive has 20,000-80,000 bees in it.

[Holding a couple beehives aloft] Beehold a man!

(punchline courtesy of Alex Gray)

The thought that GPT-3 is a mere 175 bees of brain is extremely disturbing

There's an important timelines crux to do with whether artificial neural nets are more or less parameter-efficient than biological neural nets. There are a bunch of arguments pointing in either direction, such that our prior uncertainty should range over several orders of magnitude in either direction.

Well, seeing what current models are capable of has updated me towards the lower end of that range. Seems like transformers are an OOM or two more efficient than the human brain, on a parameter-to-synapse comparison, at least when you train them for ridiculously long like we currently do.

I'd be interested to hear counterarguments to this take.

If you haven’t already seen it, I wrote about that recently here. Note the warning at the top. I wrote a decent chunk of a follow-up post, but one section will be a lot of work for me and I’m planning to procrastinate it a while. I can share a draft if you’re interested. I’m still on the “100T parameters is super-excessive for human-level AGI” side of the debate, although I think I overstated the case in that post. My take on transformers is something vaguely like “The thing that GPT-3 is doing, it’s already able to do it at or beyond human-level. However, human brains are doing other things too.”

Parameter/synapse count is actually not really that important by itself; the first principle component in terms of predictive capability is net training compute. All successful NNs operate in the overcomplete regime, where they have far more circuit capacity than the minimal circuit required to achieve a comparable capability on their training set. This is implied by the various scaling paper laws, it's also why young human children have an OOM more synapses than adults, why you can prune down a trained network by some OOMs related to it's overcapacity factor, why there are so many DL papers about the "lottery ticket" hypothesis and related, etc.

net_training_compute = synaptic_compute * training_time

It's about the total circuit space search volume explored, not the circuit size. You can achieve the same volume and thus capability by training a smaller more compressed circuit for much longer (as in ANNs), or a larger circuit for less time (as in BNNs).

Only if you're overcomplete enough to have a winning ticket at init time. With that caveat, agreed. If you don't have a winning ticket at init time, you need things like evolutionary search, which can be drastically less efficient depending on the details of the update rule.

Yeah I was tempted to make a human one, for the lols (a human is ~100k bees), ~~but decided even I have better things to do with my life than this~~

JK I'll probably do it the next time I get bored

And... it's done! Only crashed Figma like 3 times!

It's not a proper benchmark without a human baseline, so here you go:

Yes, that's the entire LM figure, to scale, on the bottom left corner.

It's like an obscure part of the old Imperial weights and measures tables.

3 crickets, 1 bee.
175 bees, one hedgehog or quail. (Rather a big jump there.)
3 quails or 2 gophers, one chinchilla.

I'd heard of a 'hive mind', but this is ridiculous.

(tone: wordplay, not criticism!)

The combination of chinchillas and whimsy always* reminds me of Magical Trevor 3.

*: "always" as in I have read this post once and it has reminded me of MT3 once.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?