i don't have a graph for it. the corresponding number is p(correct) = 0.25 at 63 elements for the one dense model i ran this on. (the number is not in the paper yet because this last result came in approximately an hour ago)
the other relevant result in the paper for answering the question of how similar our sparse models are to dense models is figure 33
creating surprising adversarial attacks using our recent paper on circuit sparsity for interpretability
we train a model with sparse weights and isolate a tiny subset of the model (our "circuit") that does this bracket counting task where the model has to predict whether to output ] or ]]. It's simple enough that we can manually understand everything about it, every single weight and activation involved, and even ablate away everything else without destroying task performance.
(this diagram is for a slightly different task because i spent an embarassingly large number of hours making this figure and decided i never wanted to make another one ever again)
in particular, the model has a residual channel delta that activates twice as strongly when you're in a nested list. it does this by using the attention to take the mean over a [ channel, so if you have two [s then it activates twice as strongly. and then later on it thresholds this residual channel to only output ]] when your nesting depth channel is at the stronger level.
but wait. the mean over a channel? doesn't that mean you can make the context longer and "dilute" the value, until it falls below the threshold? then, suddenly, the model will think it's only one level deep!
it turns out that indeed, this attack works really well on the entire sparse model (not just the circuit), and you can reliably trick it.
in retrospect, this failure is probably because extremely long nested rows are out of distribution on our specific pretraining dataset. but there's no way i would have come up with this attack by just thinking about the model.
one other worry is maybe this is just because of some quirk of weight-sparse models. strikingly, it turns out that this attack also transfers to similarly capable dense models!
where can i find more info on how far uvc compares to air purifiers? is it vastly more effective, or merely quieter? the website touches on it only very briefly.
I've heard good things about tiny-imagenet and fashion-mnist. even full imagenet is not that bad anymore with modern hardware.
i agree it's often a good idea to do experiments on small scale before big scale, in order to get tighter feedback loops, but i think mnist in particular is probably a bad task to start with. I think you probably want a dataset that's somewhat more nontrivial.
i don't think this argument is the right type signature to change the minds of the people who would be making this decision.
you could plausibly do this, and it would certainly reduce maintenance load a lot. every few years you will need to retire the old gpus and replace then with newer generation ones, and that often breaks things or makes them horribly inefficient. also, you might occasionally have to change the container to patch critical security vulnerabilities.
both costs of serving lots of obsolete models seem pretty real. you either have to keep lots of ancient branches and unit tests around in your inference codebase that you have to support indefinitely, or fork your inference codebase into two codebases, both of which you have to support indefinitely. this slows down dev velocity and takes up bandwidth of people who are already backlogged on a zillion more revenue critical things. (the sad thing about software is that you can't just leave working things alone and assume they'll keep working... something else will change and break everything and then effort will be needed to get things back to working again.)
and to have non-garbage latency it would also involve having a bunch of GPUs sit 99% idle to serve the models. if you're hosting one replica of every model you've ever released, this can soak up a lot of GPUs. it would be a small absolute % of all the GPUs used for inference, but people just aren't in the habit of allocating that many GPUs for something that very few customers would care about. it's possible to be much more GPU efficient at the cost of latency, but to get this working well is a sizeable amount of engineering effort - to setup, weeks of your best engineers' time, or months of good engineer time (and a neverending stream of maintenance)
so like in some sense neither of these are huge %s, but also you don't get to be a successful company by throwing away 5% here, 5% there.
I mean, even in the Felix Longoria Arlington case, which is what I assume you're referring to, it seems really hard for his staff members to have known, without the benefit of hindsight, that this was any significant window into his true beliefs? I mean, johnson is famously good at working himself up into appearing to genuinely believe whatever is politically convenient at the moment, and he briefly miscalculated the costs of supporting civil rights in this case. his apparent genuineness in this case doesn't seem like strong evidence.
related to contrarianism: not invented here syndrome. I think rationalists have a strong tendency to want to reinvent things their own way, bulldozing Chesterton's fences, or just reinventing the wheel with rationalist flavor. good in moderation, bad in excess