x

LESSWRONG

LW

Francisco Pernice — LessWrong

Francisco Pernice

Francisco Pernice

Message

52

1

26d

Francisco Pernice

52

26d

Predicting Rare LLM Failures with 30× Fewer Rollouts

by Santiago Aranguri and Francisco Pernice

TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space. Authors: Francisco Pernice (MIT), Santiago Aranguri (Goodfire) Introduction A harmful behavior that occurs once in...