GPT-175bee
Epistemic status: whimsical Bees: a new unit of measurement for ML model size Talking about modern ML models inevitably leads to a bunch of hard-to-intuit large numbers, especially when it comes to parameter count. To address this, we propose that we adopt a new, human-friendly unit to measure the number of learnable parameters in an architecture: 1 beepower = 1 BP = 1 billion parameters Bees have about one billion[1] synapses[2] in their forebrain[3], so this gives a nice basis for comparisons[4] between animal brains and artificial neural nets. Like horsepower and candlepower,[5] the unit of beepower expresses the scale of a new and unfamiliar technology in terms that we are already familiar with. And it makes discussion of model specs flow better. "This model has twenty bees", you might say. Or "wow, look at all that beepower; did you train long enough to make good use of it?" Here's a helpful infographic to calibrate you on this new unit of measurement: The parameter count of various recent language models, denoted in beepower. Other animals We can even benchmark[6] against more or less brainy animals. The smallest OpenAI API model, Ada, is probably[7] 350 million parameters, or about a third of a bee, which is comparable to a cricket: While Jiminy Cricket can compose better English than Ada, this cricket cannot. The next size up, Babbage, is around 1.3 BP, or cockroach-sized. Curie has almost seven bees, which is... sort of in an awkward gap between insects and mammals. Davinci is a 175-bee model, which gets us up to hedgehog (or quail) scale: As a large language model trained by OpenAI, I don't have the ability to "be the cutest little guy oh my gosh" Gopher (280 BP) is partridge (or ferret) sized. More research into actual gophers is needed to know how many gophers worth of parameters Gopher has. Really, they should've named Gopher "Partridge" or "Ferret"! Amusingly, PaLM, at 540 bees, has about as many parameters as a chinchilla has synapse
This post makes the excellent point that the paradigm that motivated SAEs -- the superposition hypothesis -- is incompatible with widely-known and easily demonstrated properties of SAE features (and feature vectors in general). The superposition hypothesis assumes that feature vectors have nonzero cosine similarity only because there isn't enough space for them all to be orthogonal, in which case the cosine similarities themselves shouldn't be meaningful. But in fact, cosine similarities between feature vectors have rich semantic content, as shown by circular embeddings (in several contexts) and feature splitting / dimensionality-reduction visualizations. Features aren't just crammed together arbitrarily; they're grouped with similar features.
I didn't properly appreciate this point before reading this post... (read more)