In Part I of CLR's safe Pareto improvements (SPI) agenda, we gave our high-level strategy for evaluating models for SPI-incompatible behavior and reasoning. This guide gives more details on how I’m thinking about executing on this strategy, especially:
the kind of workflow I think we should use, to start out;
next steps building on what I’ve tried so far; and
my rough sense of what counts as unambiguously bad “SPI-incompatibilities”.
If you’re interested in collaborating on the next steps, please get in touch! I’d be happy to flesh things out more, and invite you to the private git repo.
In Part I of CLR's safe Pareto improvements (SPI) agenda, we gave our high-level strategy for evaluating models for SPI-incompatible behavior and reasoning. This guide gives more details on how I’m thinking about executing on this strategy, especially:
If you’re interested in collaborating on the next steps, please get in touch! I’d be happy to flesh things out more, and invite you to the private git repo.