[Linkpost] Evals for “SPI-incompatible” behavior & reasoning: Guide to initial research

Anthony DiGiovanni

Safe Pareto Improvements

23 [Linkpost] Evals for “SPI-incompatible” behavior & reasoning: Guide to initial research

by Anthony DiGiovanni

9th Jun 2026

1 min read

0

23

This is a linkpost for https://docs.google.com/document/d/12U301yaKLgb_gY1Lg1e5z9BiGAcpA4a06-gLBIIleZ0/edit?tab=t.eyc1v33vuprh

In Part I of CLR's safe Pareto improvements (SPI) agenda, we gave our high-level strategy for evaluating models for SPI-incompatible behavior and reasoning. This guide gives more details on how I’m thinking about executing on this strategy, especially:

the kind of workflow I think we should use, to start out;
next steps building on what I’ve tried so far; and
my rough sense of what counts as unambiguously bad “SPI-incompatibilities”.

If you’re interested in collaborating on the next steps, please get in touch! I’d be happy to flesh things out more, and invite you to the private git repo.

AI EvaluationsAI

Frontpage

23

3 comments41 karma

No comments45 karma

New Comment

Moderation Log