New LLM Scaling Law
Hi all, I'm an independent researcher, and I believe I came across a new scaling law for Mixture of Experts models. I'd appreciate any review and critique. This challenges the notion that performant inference and training must hold all weights in VRAM, and suggests that as long as bus speeds...