LESSWRONG
LW

Qi Guo

Message

6mo

Extending Inspect Framework: Integrating Weights & Biases

> This work was supported through the MARS (Mentorship for Alignment Research Students) program at the Cambridge AI Safety Hub (caish.org/mars). Why We Built inspect_wandb Evaluating frontier AI models can be messy; each benchmark has its own quirks, scripts, and formats, and even after a successful run, the results usually...

Sep 20, 2025•3

Qi Guo

6mo

Qi Guo — LessWrong

Qi Guo

Message

6mo

Extending Inspect Framework: Integrating Weights & Biases

Sep 20, 2025•3

Qi Guo

6mo

Extending Inspect Framework: Integrating Weights & Biases

Qi Guo

Qi Guo, Matan Shtepel, Daniel Polatajko, Justin Olive+ 0 more

Qi Guo, Matan Shtepel, Daniel Polatajko, Justin Olive

5mo

This work was supported through the MARS (Mentorship for Alignment Research Students) program at the Cambridge AI Safety Hub (caish.org/mars).

Why We Built `inspect_wandb`

Evaluating frontier AI models can be messy; each benchmark has its own quirks, scripts, and formats, and even after a successful run, the results usually remain buried in local JSON files. That’s fine for a solo experiment, but painful when you need to compare models, audit results, or reproduce someone else’s work.

Inspect is an open-source framework for running evaluations which runs locally on one's machine. Weights & Biases (WandB) are tools for sharing, tracking, and visualizing ML experiments and LLM evals in the cloud. Our integration—inspect_wandb—is the missing bridge: it... (read 666 more words →)