LESSWRONG
LW

1251
nowherefly Zhang
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
How does GPT-3 spend its 175B parameters?
nowherefly Zhang3y*10

I love your work. 

Besides, to your point on FFN
I don't think FFN is more important than Attention. 
Of cause the size of parameter is important, but it is not the only import thing.  
You need to consider the compute cost in each module. If my understanding is right, FFN costs O(1), Encoder's Self-Attention costs O(n_ctx^2), Docoder's Masked Self-Attention costs...well Decoder is a little bit more complicated situation. It has to run multiple times to get the full output sequence. The total cost depends on input length and output length. For simplicity, each token generated, it costs O(n_lefttokens), approximately O(n). That means not all parameters are equally important. 
So if you do a compute-cost-weighted sum, the parameters of attention module is way more important than FFN. 

This could also explain why there is a limit on n_ctx. 

And This could also also give you a hint that though Encoder and Decoder are structurally similar, they are doing different things. And GPT IS a decoder-only architecture. 

Reply