Predicting Rare LLM Failures with 30× Fewer Rollouts
by Santiago Aranguri and Francisco Pernice
TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space. Authors: Francisco Pernice (MIT), Santiago Aranguri (Goodfire) Introduction A harmful behavior that occurs once in...
May 1355