This is a special post for quick takes by Ben Livengood. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
1 comment, sorted by Click to highlight new comments since: Today at 8:25 AM

https://github.com/Ying1123/FlexGen is a way to run large (175B parameter) LLMs on a single GPU at ~1 token/s which I think puts it within the reach of many hobbyists and I predict we'll see an explosion of new capability research in the next few months.

I haven't had a chance to dig into the code but presumably this could also be modified to allow local fine-tuning of the large models at a slow but potentially useful rate.

I'm curious if any insights will make their way back to the large GPU clusters. From my cursory glance it doesn't seem like there are throughput or latency advantages unless weight compression can be used to run the entire model on fewer GPUs with e.g. swapping layer weights in and out and caching latey outputs in batch inference.