Evals in the Age of Jarvis
The last couple of years gave us a fairly clean story around LLM pretraining: scale up the data, scale up the compute, use next-token prediction as a universal loss, and watch as a kind of implicit multi-task learning emerges. The evals ecosystem followed this arc - benchmark suites for reasoning,...