I am not an expert in ML but based on some conversations I was following, I heard WuDao's LAMBADA score (an important performance measure for Language Models) is significantly lower than GPT-3. I guess a number of parameters isn't everything.
I don't really know a lot about performance metrics for language models. Is there a good reason for believing that LAMBADA scores should be comparable for different languages?
Word on the grapevine: it sounds like they might just be adding a bunch of parameters in a way that's cheap to train but doesn't actually work that well (i.e. the "mixture of experts" thing).
It would be highly entertaining if ML researchers got into an arms race on parameter count, then Goodharted on it. Sounds like exactly the sort of thing I'd expect not-very-smart funding agencies to throw lots of money at. Perhaps the Goodharting would be done by the funding agencies themselves, by just funding whichever projects say they will use the most parameters, until they end up with lots of tiny nails. (Though one does worry that the agencies will find out that we can already do infinite-parameter-count models!)
That said, I haven't looked into it enough myself to be confident that that's what's happening here. I'm just raising the hypothesis from entropy.
I think this take is basically correct. Restating my version of it:
Mixture of Experts and similar approaches modulate paths through the network, such that not every parameter is used every time. This means that parameters and FLOPs (floating point operations) are more decoupled than they are in dense networks.
To me, FLOPs remains the harder-to-fake metric, but both are valuable to track moving forward.
In a funny way, even if someone is stuck in a Goodhart trap doing Language Models it is probably better to Goodhart performance on Winograd Schemas than just adding parameters.
I think the engadget article failed to capture the relevant info, so just putting my preliminary thoughts down here. I expect my thoughts to change as more info is revealed/translated.
Loss on the dataset (for cross-entropy this is measured in bits of perplexity per token or per character) is a more important metric than parameter count, in my opinion.
However, I think parameter count does matter at least a small part because it is a signal for:
* the amount of resources that are available to the researchers (very expensive to do very large runs)
* the amount of engineering capacity that the project has access to (difficult to write code that functions well at that scale -- nontrivial to just code a working 1.7T parameter model training loop)
I expect more performance metrics at some point, on the normal set of performance benchmarks.
I also expect to be very interested in how they release/share/license the model (if at all), and who is allowed access to it.
If I understood correctly, the model was trained in Chinese and probably quite expensive to train.
Do you know whether these Chinese models usually get "translated" to English, or whether there is a "fair" way of comparing models that were (mainly) trained on different languages (I'd imagine that even the tokenization might be quite different for Chinese)?
In my experience, I haven't seen a good "translation" process -- instead models are pretrained on bigger and bigger corpora which include more languages.
GPT-3 was trained on data that was mostly english, but also is able to (AFAICT) generate other languages as well.
For some english-dependent metrics (SuperGLUE, Winogrande, LAMBADA, etc) I expect a model trained on primarily non-english corpora would do worse.
Also, yes, the tokenization I would expect to be different for a largely different corpora.
I don't think I could even imagine what kinds of Deep Fakes could be made using this system. Maybe used for propaganda first to develop the tech further? I'm usually just a little suspicious of new tech coming from anywhere though, not just China.
How big of a deal is that? Seems huge. Bigger than switch transformers and 10x bigger than GPT-3.