https://github.com/Ying1123/FlexGen is a way to run large (175B parameter) LLMs on a single GPU at ~1 token/s which I think puts it within the reach of many hobbyists and I predict we'll see an explosion of new capability research in the next few months.
I haven't had a chance to dig into the code but presumably this could also be modified to allow local fine-tuning of the large models at a slow but potentially useful rate.
I'm curious if any insights will make their way back to the large GPU clusters. From my cursory glance it doesn't seem like there are throughput or latency advantages unless weight compression can be used to run the entire model on fewer GPUs with e.g. swapping layer weights in and out and caching latey outputs in batch inference.
Are we into recursive self-improvement yet?
'“AlphaEvolve began optimizing the lowest levels of hardware powering our AI stacks. It proposed a circuit design so counterintuitive yet efficient that it was integrated directly into the silicon of our next-generation TPUs. This is the latest example of TPU brains helping design next-generation TPU bodies.” — Jeff Dean, Chief Scientist, Google DeepMind and Google Research'
From https://deepmind.google/blog/alphaevolve-impact/
I have always thought of RSI as a speedup multiplier, and it sounds like this is greater than 1 for hardware as well as software now. Maybe 1.05 or 1.1?
Practical impact I predict with AlphaEvolve's TPU work; the next order of magnitude training run(s) will start slightly sooner because it is cheaper even if it's not necessarily faster.