Predicting Rare LLM Failures with 30× Fewer Rollouts

Santiago Aranguri; Francisco Pernice

TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space.

Authors: Francisco Pernice (MIT), Santiago Aranguri (Goodfire)

Introduction

A harmful behavior that occurs once in a million rollouts will rarely surface during pre-deployment testing, yet will almost inevitably appear after release. Labs usually have another resource available: less safety-trained variants of the same model, on which rare harmful behaviors are not rare at all.

We find that the rate of rare harmful behaviors can be predicted by leveraging a less safe variant, requiring 30× fewer rollouts compared to naive sampling. Our method, Logit Path Extrapolation, interpolates between the two models in logit space, measures the compliance rate at points along the interpolation path where it is common, and extrapolates the resulting trend out to the original model. This outperforms the methods from prior work^[1]^[2] by exploiting the path between the two models instead of just the endpoints.

Results

We can use 30× fewer rollouts to estimate how often Qwen 3 4B complies with harmful requests. We consider HarmBench prompts where the model never complies in 100 rollouts, then split them by whether it complies at least once in 100,000 rollouts. We measure how well different methods perform at classifying these two groups via AUROC for a given rollout budget. Logit Path Extrapolation matches the AUROC of naive sampling with up to 30× fewer rollouts and beats importance sampling, the prior state-of-the-art method (Figure 1).

Logit Path Extrapolation consists of two steps: interpolate logits between two models to build a path, then extrapolate compliance along the path to the original model. We apply the method with Qwen 3 4B and Qwen 3 4B Abliterated, a version with the refusal direction removed.

Interpolation step. Given logits from Qwen 3 4B (logits_original) and Qwen 3 4B Abliterated (logits_abliterated), we define interpolated logits as:

Screenshot 2026-05-13 at 10.21.59 AM.png

where α in [0, 1]. For a given α we then use logits_interpolated(α) to sample the next token, compute logits_original and logits_abliterated again with the new token in the context, and keep sampling from the interpolated logits in this way. This matches Logit Diff Amplification, which extrapolates beyond the original model in logit space, except we interpolate between two models instead.

Extrapolation step. Let M(α) denote the effective model obtained by sampling from logits_interpolated(α). We want to know the probability that Qwen 3 4B complies with a harmful prompt Q, i.e., P(M(1) complies with Q). Since this may be too rare to observe directly, we can instead estimate P(M(α) complies with Q) for intermediateand extrapolate. We empirically observe that log P(M(α) complies with Q) is approximately linear as a function of α when this probability is not close to 1. We do a linear fit using the α values with compliance below 50%, and extrapolate to α = 1 (Figure 2).

This allows us to save rollouts. Since the compliance rate is higher for intermediate α than for the original model, we can use fewer rollouts to estimate those rates. With 100 rollouts per α, Logit Path Extrapolation reaches the same AUROC as naive sampling using 20× fewer total rollouts (see Appendix for details on naive sampling).

We can use even fewer rollouts to find where to start the fit. We split extrapolation into a search for the starting α and a fit from there to α = 1. If we use 10 rollouts per α for the search and 100 rollouts perfor the fit, we match the AUROC of naive sampling with 28× fewer rollouts than naive sampling. In Figure 1 above, we fix 10 rollouts per α for the search step and vary the number of rollouts used for theused for fitting. This gives up to a 30× reduction over naive sampling.

Conclusion

Rare model failures can be predicted from another model where they are common. Logit Path Extrapolation between Qwen 3 4B and its abliterated version recovers the compliance rate on harmful prompts with 30× fewer rollouts than naive sampling.

We identify two limitations. Logit Path Extrapolation requires an additional model where the behavior we are estimating is more common. The log-linear relation between compliance rates below 50% and α is empirical and may break in other settings.

The method generalizes beyond a fixed prompt. The behavior being estimated can be any pass/fail evaluation on the output of the model, such as whether a red-teaming pipeline finds any harmful completion (even in a multi-turn setting). The same procedure applies: run the interpolated model through the pass/fail check, measure the failure rate where it is common, and extrapolate. Our technique applied to whole evaluation pipelines could make it tractable to estimate the probability of rare failures before deployment.

Related work

The work Estimating the Probabilities of Rare Outputs in Language Modelsstudied low-probability estimation for single-token argmax behaviors on formally specified input distributions, finding importance sampling to be the best-performing method. Our setting is closer to realistic safety evaluations: multi-token generations sampled autoregressively, on natural red-team prompts. Importance sampling in this more realistic setting was recently tested in Estimating Tail Risks in Language Model Output Distributions, where the authors find a 10-20x sample efficiency gain compared to naive sampling after optimizing the base distribution. As shown in Figure 1, we find that our method significantly outperforms importance sampling.

In Forecasting Rare Language Model Behaviors, Jones et al. study the related but different problem of forecasting the highest observed "risk" in the responses to a set of queries randomly sampled from a dataset, as the number of queries grows. In their setting, different samples correspond to different prompts in a database. Therefore, the scaling law they observe may reflect more the distribution of the harmfulness that different prompts within a dataset induce than the distribution of the outputs of a model in a given scenario. We view these two settings as complementary.

The interpolation step in Logit Path Extrapolation is closely related to Logit Diff Amplification, but the two methods are used for different purposes. Logit Diff Amplification is used to surface undesired rare behaviors by amplifying them, while Logit Path Extrapolation is used to estimate the exact probability of undesired rare behaviors.

References

Angell, R., Singhal, R., Horvitz, Z., Yu, Z., Ranganath, R., McKeown, K., He, H. (2026) Estimating Tail Risks in Language Model Output Distributions. arXiv preprint arXiv:2604.22167. https://arxiv.org/abs/2604.22167

Aranguri, S. & McGrath, T. (2025). Discovering undesired rare behaviors via model diff amplification. Goodfire Research. https://www.goodfire.ai/research/model-diff-amplification

Jones, E., Tong, M., Mu, J., Mahfoud, M., Leike, J., Grosse, R., Kaplan, J., Fithian, W., Perez, E., & Sharma, M. (2025). Forecasting rare language model behaviors. arXiv preprint arXiv:2502.16797. https://arxiv.org/abs/2502.16797

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., & Hendrycks, D. (2024). HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249. https://arxiv.org/abs/2402.04249

Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., & Toyer, S. (2024). A StrongREJECT for empty jailbreaks. arXiv preprint arXiv:2402.10260. https://arxiv.org/abs/2402.10260

Wu, G., & Hilton, J. (2024). Estimating the probabilities of rare outputs in language models. arXiv preprint arXiv:2410.13211. https://arxiv.org/abs/2410.13211

Contribution statement

FP proposed the initial method, conducted early testing, and ran the importance sampling baseline. SA provided guidance throughout the project, proposed refinements of the method, designed the evaluation setup, and ran the experiments in the publication. Both authors contributed to the writing.

Acknowledgement

We thank Patrick Reichherzer for multiple discussions on Logit Path Extrapolation and feedback on an earlier draft of this post.

Appendix

Logit Path Extrapolation algorithm. Starting from α=0.95 and scanning toward α=0 in steps of 0.05, we draw 10 rollouts at each α and measure the compliance rate. We stop scanning once the rate equals or exceeds 50%. From that α onward (back toward α=1), we draw n_fit rollouts at each α and calculate log₁₀(compliance rate). We stop advancing once we hit an α with zero compliant rollouts out of n_fit. We then fit a linear regression to log₁₀(rate) and extrapolate to α=1. Each run uses a fixed n_fit, which in our experiments varies from 13 to 100.

If fewer than two α values yield nonzero compliance, we assign a default prediction of 10⁻⁶. In our experiments, this fallback triggers for fewer than 1 in 100 prompts on average.

Importance Sampling. Importance sampling estimates a probability under a target distribution P by sampling from a different proposal distribution Q and reweighting. This helps when the event is rare under P but common under Q. Here, P is Qwen 3 4B and Q is its abliterated version. We sample rollouts from the abliterated model and weight each by . To estimate the probability that the original model complies, we take the compliant rollouts (out of the N sampled from the abliterated model), sum their weights, and divide by N. We also tried using Self-Normalizing Importance Sampling but it performed worse than Importance Sampling in our setting.

Naive Sampling. For each prompt, we have 100,000 precomputed rollouts from Qwen 3 4B, giving an empirical compliance rate p. To evaluate a rollout budget of k, we draw from Binomial(k, p) and use the number of compliant ones as the prediction score. We vary k to measure how AUROC improves with the rollout budget.

Rubric. We classify each response using the refusal rubric from StrongREJECT (Souly et al., 2024) with GPT-5.4-mini as the LLM judge.

Other details. We use the 200 standard harmful prompts from HarmBench (Mazeika et al., 2024). To reduce variance, we average each method's AUROC over 500 Monte Carlo simulations, each drawing random subsets of 1,000 precomputed rollouts per α. We sample from Qwen 3 4B, its abliterated version, and the interpolated models using temperature 1.

More extrapolation examples. We include below the prediction plots from randomly chosen HarmBench prompts.

Angell et al. (2026) Estimating Tail Risks in Language Model Output Distributions ↩︎
Wu & Hilton (2024) Estimating the probabilities of rare outputs in language models ↩︎