Note I'm mainly using this as an opportunity to talk about ideas and compute in NLP.
I don't know how big an improvement DeBERTaV2 is over SoTA.
DeBERTaV2 is pretty solid and mainly got its performance from an architectural change. Note the DeBERTa paper was initially uploaded in 2020, but it was updated early this year to include DeBERTa V2. The previous main popular SOTA on SuperGLUE was T5 (which beat RoBERTa). DeBERTaV2 uses 8x fewer parameters and 4x less compute than T5. DeBERTa's high performance isn't an artifact of SuperGLUE; in downstream tasks such as some legal NLP tasks it does better too.
Compared to unidirectional models on NLU tasks, the bidirectional models do far better. On CommonsenseQA, a good task that's been around for a few years, the bidirectional models do far better than fine-tuned GPT-3--DeBERTaV3 differs in three ideas from GPT-3 (roughly encoding, ELECTRA training, and bidirectionality, if I recall correctly), and it's >400x smaller.
I agree with the overall sentiment that much of the performance is from brute compute, but even in NLP, ideas can help sometimes. For vision/continuous signals, algorithmic advances continue to account for much progress; ideas move the needle substantially more frequently in vision than in NLP.
For tasks when there is less traction, ideas are even more useful. Just to use a recent example, "the use of verifiers results in approximately the same performance boost as a 30x model size increase." I think the initially proposed heuristic depends on how much progress has already been made on a task. For nearly solved tasks, the next incremental idea shouldn't help much. On new hard tasks such as some maths tasks, scaling laws are worse and ideas will be a practical necessity. Not all the first ideas are obvious "low hanging fruits" because it might take a while for the community to get oriented and find good angles of attack.
RE: "like I'm surprised if a clever innovation does more good than spending 4x more compute"
Earlier this year, DeBERTaV2 did better on SuperGLUE than models 10x the size and got state of the art.
Models such as DeBERTaV3 can do better than on commonsense question answering tasks than models that are tens or several hundreds of times larger.
Accuracy: 84.6 1 Parameters: 0.4B
Accuracy: 83.5 1 Parameters: 11B
73.0 1 175B
Bidirectional models + training ideas + better positional encoding helped more than 4x.
In safety research labs in academe, we do not have a resource edge compared to the rest of the field.
We do not have large GPU clusters, so we cannot train GPT-2 from scratch or fine-tune large language models in a reasonable amount of time.
We also do not have many research engineers (currently zero) to help us execute projects. Some of us have safety projects from over a year ago on the backlog because there are not enough reliable people to help execute the projects.
These are substantial bottlenecks that more resources could resolve.