The goal of the post is to share with you how easy it is to load a llama 405b model in Runpod but also how it might be costly if you don’t know some things in advance, so I hope this post will help you to save these precious gpu minutes!
First, Llama 405b model is huge:

Let’s talk GPU!
You need the right choice of GPU and high disk memory to save model parameters and give some overhead for your experiment, like doing inference and saving some internal activations.
Some good options are H100, A100 and H200 machines:

The H200 currently wins on both cost and inference speed. With FP8 quantization and higher VRAM per GPU,... (read 601 more words →)