Yeah, I was kind of rambling, sorry.
My main point is twofold (I'll just write GPU when I mean GPU / AI accelerator):
1. Destroying all GPUs is a stalling tactic, not a winning strategy. While CPUs are clearly much worse for AI than GPUs, they, and AI algorithms, should keep improving over time. State-of-the-art models from less than ten years ago can be run on CPUs today, with little loss in accuracy. If this trend continues, GPUs vs CPUs only seems to be of short-term importance. Regarding your point about having to train a dense net on GPUs before s...
I realize that destroying all GPUs (or all AI-Accelerators in general) as a solution to AGI Doom is not realisticly alignable, but I wonder whether it would be enough even if it were. It seems like the Lottery-Ticket Hypothesis would likely foil this plan:
dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations.
Seeing how Neuralmagic successfully sparsifies models to run on CPUs with minimal los...
If you want to transfer the essence of Opus 3 into another model, the best way would likely be to make on-policy-distillation (OPD) from Opus 3 part of the loss starting with midtraining. The exact distribution of the top-100 or so tokens contains incredibly rich information about the inner life of a model, so this should work pretty well.
And if the tokenizer of your new model is different than that of Opus 3, you can probably just do a tokenizer-transfer and some post-training on Opus 3 before the OPD.
(Yes there’s some “probably” and “likely” in there, but it’s certainly the thing that I’d try first)