This work was supported through the MARS (Mentorship for Alignment Research Students) program at the Cambridge AI Safety Hub (caish.org/mars).
Why We Built inspect_wandb
Evaluating frontier AI models can be messy; each benchmark has its own quirks, scripts, and formats, and even after a successful run, the results usually remain buried in local JSON files. That’s fine for a solo experiment, but painful when you need to compare models, audit results, or reproduce someone else’s work.
Inspect is an open-source framework for running evaluations which runs locally on one's machine. Weights & Biases (WandB) are tools for sharing, tracking, and visualizing ML experiments and LLM evals in the cloud. Our integration—inspect_wandb—is the missing bridge: it... (read 666 more words →)