673

LESSWRONG
LW

672

Ben Livengood's Shortform

by Ben Livengood
20th Feb 2023
1 min read
1

3

This is a special post for quick takes by Ben Livengood. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Ben Livengood's Shortform
11Ben Livengood
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 7:22 PM
[-]Ben Livengood3y111

https://github.com/Ying1123/FlexGen is a way to run large (175B parameter) LLMs on a single GPU at ~1 token/s which I think puts it within the reach of many hobbyists and I predict we'll see an explosion of new capability research in the next few months.

I haven't had a chance to dig into the code but presumably this could also be modified to allow local fine-tuning of the large models at a slow but potentially useful rate.

I'm curious if any insights will make their way back to the large GPU clusters. From my cursory glance it doesn't seem like there are throughput or latency advantages unless weight compression can be used to run the entire model on fewer GPUs with e.g. swapping layer weights in and out and caching latey outputs in batch inference.

Reply
Moderation Log
More from Ben Livengood
View more
Curated and popular this week
1Comments